Introduction
This document details the code for the manuscript entitled “Association of Gout Polygenic Risk Score with Age at Disease Onset and Tophaceous Disease in European and Polynesian Men with Gout”. This manuscript was submitted to Arthritis and Rheumatology on May 13th 2022 and received reviewers comments on June 14th 2022. We resubmitted the revised manuscript in early September 2022.
Note that you can hide or show the code for a particular section by clicking the “Code” button on the right. If you wish to hide all code you can do so with the “Code” drop-down menu at the top right of the document.
# Loading required libraries.
library(here)
library(vroom)
library(knitr)
library(kableExtra)
library(lubridate)
library(viridis)
library(janitor)
library(sciplot)
library(meta)
library(broom)
library(MASS)
library(gmodels)
library(ez)
library(car)
library(pastecs)
library(qqman)
library(ggrepel)
library(fs)
library(grid)
library(tiff)
library(lm.beta)
library(rsq)
library(ukbtools)
library(tidyverse)
# Setting options for knitr.
knitr::opts_chunk$set(fig.width = 8.5, warning = FALSE, message = FALSE, engine.opts = list(bash = "-l"))
# Setting theme for ggplot.
theme_set(theme_bw())
# Setting scratch path as a variable
scratch_path <- path("/Volumes/scratch/merrimanlab/Nick/PRS")
Running the Gout GWAS in the UK Biobank cohort
We ran a genome-wide association study in the UK Biobank European cohort with gout as the outcome. This was done with a total of 27,287,012 variants (after imputation), and adjusted for age, sex, and the first 40 genetic principal components.
Initially, the full list of variants for each chromosome was used to define regions that contained 100,000 variants (or the remaining variants if the last region of a chromosome). These splits were converted into a series of files (one per chromosome) with the format of “splitnumber chromosomenumber:startcoord-endcoord” for each row (representing each region). The following code was saved as a script called split_chr.sh in /Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474/splits.
# Setting UKBB_DIR variable.
UKBB_DIR=/Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474
# Set i variable to first input (chromosome number).
i=$1
# Split the corresponding mfi file into sets of 100000 rows per file.
split --lines=100000 --suffix-length=5 --numeric-suffixes=1 $UKBB_DIR/ukb_mfi_chr${i}_v3.txt $UKBB_DIR/splits/chr${i}_
# If chr number less than 10, add a 0 before it.
if [[ $i -lt 10 ]]
then
i2=$(echo "0${i}")
else
i2=$i
fi
# Summarize the chr, start, and end coordinates of each file then add them all together as rows of an output file with the row number attached.
for FILE in $UKBB_DIR/splits/chr${i}_*
do echo "${i2}:$(paste -d '-' <(head -1 $FILE | cut -f3) <(tail -1 $FILE | cut -f3))"
done | awk '{print NR, $0}' > $UKBB_DIR/splits/split_chr${i}.txt
# Remove the temporary split files.
rm $UKBB_DIR/splits/chr${i}_*
This script was then run using the following code:
# Setting UKBB_DIR variable.
UKBB_DIR=/Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474
# Running split_chr.sh in parallel for each chromosome.
parallel --progress -P $UKBB_DIR/splits/parallel_process.txt --tmp $UKBB_DIR/splits/tmp/ 'bash {1}/splits/split_chr.sh {2}' ::: $UKBB_DIR ::: {1..22} X XY
Next, we aimed to convert the genotypes from bgen format into vcf.gz for each split of each chromosome as defined above. The following code was saved as a script called extract_convert.sh in /Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474/splits.
# Setting UKBB_DIR variable.
UKBB_DIR=/Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474
# Loading bcftools version 1.10.2.
module load bcftools/bcftools-1.10.2
# Setting the COORD variable as the first argument passed to this script.
COORD=$1
# Setting the SPLIT, RANGE, CHR and CHR_ALT variables based on the COORD variable.
SPLIT=$(echo $COORD | cut -d " " -f1)
RANGE=$(echo $COORD | cut -d " " -f2)
CHR=$(echo $RANGE | cut -d ":" -f1 | sed "s/^0//g")
CHR_ALT=$(echo $RANGE | cut -d ":" -f1)
# Converting UK Biobank bgen files for chromosomes 1 to 22 into vcf.gz files. This script works on a single chromosome and splits the bgen file into multiple vcf.gz files. The input coordinates are in the form "splitnumber chromosomenumber:startcoord-endcoord".
bgenix -g $UKBB_DIR/ukb_imp_chr${CHR}_v3.bgen \
-i $UKBB_DIR/ukb_imp_chr${CHR}_v3.bgen.bgi -vcf -incl-range $RANGE | \
sed "s/^0//g" | \
bcftools reheader -s $UKBB_DIR/bgen_to_vcf/id_convert.txt | \
bcftools reheader -h $UKBB_DIR/splits/new_header.txt | \
bcftools view -S $UKBB_DIR/splits/gwas_keep_ids.txt -O z -o $UKBB_DIR/splits/bgenix_convert_chr${CHR_ALT}_${SPLIT}.vcf.gz
Given that the X chromosome had to be analyzed in a different manner, the following code only applies to the X chromosome splits. This was saved as extract_convert_chrX.sh in /Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474/splits.
# Setting UKBB_DIR variable.
UKBB_DIR=/Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474
# Loading bcftools version 1.10.2.
module load bcftools/bcftools-1.10.2
# Setting the COORD variable as the first argument passed to this script.
COORD=$1
# Setting the SPLIT, RANGE, CHR and CHR_ALT variables based on the COORD variable.
SPLIT=$(echo $COORD | cut -d " " -f1)
RANGE=$(echo $COORD | cut -d " " -f2)
CHR=$(echo $RANGE | cut -d ":" -f1 | sed "s/^0//g")
CHR_ALT=$(echo $RANGE | cut -d ":" -f1)
# Converting UK Biobank bgen files for chromosome X into vcf.gz files. This script works on a single chromosome and splits the bgen file into multiple vcf.gz files. The input coordinates are in the form "splitnumber chromosomenumber:startcoord-endcoord".
bgenix -g $UKBB_DIR/ukb_imp_chr${CHR}_v3.bgen \
-i $UKBB_DIR/ukb_imp_chr${CHR}_v3.bgen.bgi -vcf -incl-range $RANGE | \
sed "s/^0//g" | \
bcftools reheader -s $UKBB_DIR/bgen_to_vcf/id_convert_chrX.txt | \
bcftools reheader -h $UKBB_DIR/splits/new_header_chrX.txt | \
bcftools view -S $UKBB_DIR/splits/gwas_keep_ids.txt --force-samples -O z -o $UKBB_DIR/splits/bgenix_convert_chr${CHR_ALT}_${SPLIT}.vcf.gz
Finally the XY chromosome had to be analyzed in a different manner, thus the following code only applies to the XY chromosome splits. This was saved as extract_convert_chrXY.sh in /Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474/splits.
# Setting UKBB_DIR variable.
UKBB_DIR=/Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474
# Loading bcftools version 1.10.2.
module load bcftools/bcftools-1.10.2
# Setting the COORD variable as the first argument passed to this script.
COORD=$1
# Setting the SPLIT, RANGE, CHR and CHR_ALT variables based on the COORD variable.
SPLIT=$(echo $COORD | cut -d " " -f1)
RANGE=$(echo $COORD | cut -d " " -f2)
CHR=$(echo $RANGE | cut -d ":" -f1 | sed "s/^0//g")
CHR_ALT=$(echo $RANGE | cut -d ":" -f1)
# Converting UK Biobank bgen files for chromosome XY into vcf.gz files. This script works on a single chromosome and splits the bgen file into multiple vcf.gz files. The input coordinates are in the form "splitnumber chromosomenumber:startcoord-endcoord".
bgenix -g $UKBB_DIR/ukb_imp_chr${CHR}_v3.bgen \
-i $UKBB_DIR/ukb_imp_chr${CHR}_v3.bgen.bgi -vcf -incl-range $RANGE | \
sed "s/^0//g" | \
bcftools reheader -s $UKBB_DIR/bgen_to_vcf/id_convert_chrXY.txt | \
bcftools reheader -h $UKBB_DIR/splits/new_header_chrXY.txt | \
bcftools view -S $UKBB_DIR/splits/gwas_keep_ids.txt --force-samples -O z -o $UKBB_DIR/splits/bgenix_convert_chr${CHR_ALT}_${SPLIT}.vcf.gz
The above scripts were run in parallel to produce vcf.gz files for each split of each chromosome using the following code.
# Setting UKBB_DIR variable.
UKBB_DIR=/Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474
# Running extract_convert.sh in parallel for each split of each autosome.
parallel --progress -P $UKBB_DIR/splits/parallel_process.txt --tmp $UKBB_DIR/splits/tmp/ 'bash {1}/splits/extract_convert.sh {2}' ::: $UKBB_DIR :::: $UKBB_DIR/splits/split_chr{1..22}.txt
# Running extract_convert_chrX.sh in parallel for each split of the X chromosome.
parallel --progress -P $UKBB_DIR/splits/parallel_process.txt --tmp $UKBB_DIR/splits/tmp/ 'bash {1}/splits/extract_convert_chrX.sh {2}' ::: $UKBB_DIR :::: $UKBB_DIR/splits/split_chrX.txt
# Running extract_convert_chrXY.sh in parallel for each split of the XY chromosome.
parallel --progress -P $UKBB_DIR/splits/parallel_process.txt --tmp $UKBB_DIR/splits/tmp/ 'bash {1}/splits/extract_convert_chrXY.sh {2}' ::: $UKBB_DIR :::: $UKBB_DIR/splits/split_chrXY.txt
Next, the following code was saved as do_gout_gwas.sh in /Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474/splits.
# Setting UKBB_DIR variable.
UKBB_DIR=/Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474
# Setting the "file" variable based on the first argument passed to the script.
file=$1
# Setting the "mem" variable to 36000, representing 36 Gb of RAM in plink code.
mem=36000
# Converting the file from the vcf.gz format to the PLINK binary format "bed/bim/fam".
plink1.9b6.10 --vcf $UKBB_DIR/splits/${file}.vcf.gz --make-bed --out $UKBB_DIR/splits/${file} --memory $mem
# Creating temporary PLINK fileset which is filtered to keep only individuals with IDs in the gout_gwas_keep_ids_w_sex.txt file, then adding the phenotype information based on the plink_goutaff column of gout_gwas_covar.covar, finally updating the sex variable based on the gout_gwas_keep_ids_w_sex.txt file.
plink1.9b6.10 --bfile $UKBB_DIR/splits/${file} --out $UKBB_DIR/splits/${file}_tmp --keep $UKBB_DIR/splits/gout_gwas_keep_ids_w_sex.txt --pheno $UKBB_DIR/splits/gout_gwas_covar.covar --pheno-name plink_goutaff --update-sex $UKBB_DIR/splits/gout_gwas_keep_ids_w_sex.txt --make-bed --memory $mem
# Running the logistic regression GWAS, additionally outputting a minor allele frequency report, filtering for SNPs with less than 10% missingness, producing a report of missingness, adding 95% confidence intervals to the logistic regression results, filtering variants to exclude those with minor allele frequency of less than 0.0001 (0.01%), filtering to exclude variants that fail the Hardy-Weinberg equilibrium test with p < 0.000001, producing a Hardy-Weinberg report, then providing the covariates (age and the first 40 principal components) from the gout_gwas_covar.covar file.
plink1.9b6.10 --bfile $UKBB_DIR/splits/${file}_tmp --logistic sex --freq case-control --geno 0.1 --missing --ci 0.95 --maf 0.0001 --hwe 0.000001 --hardy --out $UKBB_DIR/splits/gout_gwas/${file} --covar $UKBB_DIR/splits/gout_gwas_covar.covar --covar-name Age,pc1-pc40 --memory $mem
# Deleting all created PLINK binary files.
rm $UKBB_DIR/splits/${file}.{bed,bim,fam} $UKBB_DIR/splits/${file}_tmp.{bed,bim,fam}
# Modifying each of the PLINK reports from running the GWAS function to first convert multiple spaces into a single space, then convert spaces to tabs, then remove leading tabs from each row, then remove trailing tabs from each row, then compress the result and save it as a .tsv.gz file, then also remove the original file.
tr -s ' ' < $UKBB_DIR/splits/gout_gwas/${file}.assoc.logistic | tr ' ' '\t' | sed 's/^\t//g' | sed 's/\t$//g' | gzip -c > $UKBB_DIR/splits/gout_gwas/${file}.assoc.logistic.tsv.gz && rm $UKBB_DIR/splits/gout_gwas/${file}.assoc.logistic
tr -s ' ' < $UKBB_DIR/splits/gout_gwas/${file}.frq.cc | tr ' ' '\t' | sed 's/^\t//g' | sed 's/\t$//g' | gzip -c > $UKBB_DIR/splits/gout_gwas/${file}.frq.cc.tsv.gz && rm $UKBB_DIR/splits/gout_gwas/${file}.frq.cc
tr -s ' ' < $UKBB_DIR/splits/gout_gwas/${file}.hwe | tr ' ' '\t' | sed 's/^\t//g' | sed 's/\t$//g' | gzip -c > $UKBB_DIR/splits/gout_gwas/${file}.hwe.tsv.gz && rm $UKBB_DIR/splits/gout_gwas/${file}.hwe
tr -s ' ' < $UKBB_DIR/splits/gout_gwas/${file}.imiss | tr ' ' '\t' | sed 's/^\t//g' | sed 's/\t$//g' | gzip -c > $UKBB_DIR/splits/gout_gwas/${file}.imiss.tsv.gz && rm $UKBB_DIR/splits/gout_gwas/${file}.imiss
tr -s ' ' < $UKBB_DIR/splits/gout_gwas/${file}.lmiss | tr ' ' '\t' | sed 's/^\t//g' | sed 's/\t$//g' | gzip -c > $UKBB_DIR/splits/gout_gwas/${file}.lmiss.tsv.gz && rm $UKBB_DIR/splits/gout_gwas/${file}.lmiss
The above script was run in parallel to run a GWAS for gout for each split of each chromosome using the following code.
# Setting UKBB_DIR variable.
UKBB_DIR=/Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474
# Run the command 'bash /Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474/splits/do_gout_gwas.sh {2}' in parallel, each time replacing the {2} with the basename of a .vcf.gz file in the /Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474/splits directory. This command will also print the progress of each task, and will run up to 5 jobs in parallel at a time (this number is stored in parallel_process.txt), also send temporary outputs into a subdirectory called tmp/.
parallel --progress -P $UKBB_DIR/splits/parallel_process.txt --tmp $UKBB_DIR/splits/tmp/ 'bash {1}/splits/do_gout_gwas.sh {2}' ::: $UKBB_DIR ::: $(basename -s .vcf.gz -a $(ls $UKBB_DIR/splits/bgenix_convert_chr*.gz))
The gout GWAS output files (*.assoc.logistic.tsv.gz) were then filtered and concatenated together for each chromosome. The following code was saved as concat_gout_gwas.sh in /Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474/splits
# Setting UKBB_DIR variable.
UKBB_DIR=/Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474
# Setting the "i" variable based on the first argument passed to the script.
i=$1
# Concatenating all files together for each chromosome and filtering for SNP effects only (ADD) then sorting numerically by BP position.
for FILE in $UKBB_DIR/splits/gout_gwas/bgenix_convert_chr${i}_*.assoc.logistic.tsv.gz
do zcat $FILE | awk -F '\t' '$5 == "ADD"'
done | sort -nk 3 > $UKBB_DIR/splits/gout_gwas/gout_gwas_chr${i}_add_unfiltered_p.tsv
The above script was run in parallel for each chromosome using the following code.
# Setting UKBB_DIR variable.
UKBB_DIR=/Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474
# Run the command 'bash /Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474/splits/concat_gout_gwas.sh {2}' in parallel, each time replacing the {2} with the chromosome number. This command will also print the progress of each task, and will run up to 5 jobs in parallel at a time, also send temporary outputs into a subdirectory called tmp/.
parallel --progress -P $UKBB_DIR/splits/parallel_process.txt --tmp $UKBB_DIR/splits/tmp/ 'bash {1}/splits/concat_gout_gwas.sh {2}' ::: $UKBB_DIR ::: {1..22} X XY
Finally, these chromosome wide summary stats were concatenated into ukbb_gout-allcontrol_chr1-22.X.XY.add_unfiltered_p.tsv.
# Setting UKBB_DIR variable.
UKBB_DIR=/Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474
# Making empty file with header row.
zcat $UKBB_DIR/splits/gout_gwas/bgenix_convert_chr10_1.assoc.logistic.tsv.gz | head -1 > $UKBB_DIR/splits/gout_gwas/ukbb_gout-allcontrol_chr1-22.X.XY.add_unfiltered_p.tsv
# Concatenating all summary stats files together.
for file in $UKBB_DIR/splits/gout_gwas/gout_gwas_chr*_add_unfiltered_p.tsv
do cat $file >> $UKBB_DIR/splits/GWAS_results/gout/ukbb_gout-allcontrol_chr1-22.X.XY.add_unfiltered_p.tsv
done
Filtering the gout GWAS summary statistics
Next, I filtered the gout GWAS summary statistics to only include SNPs that were genotyped on the CoreExome genotyping chip. Additionally, I removed any indels, variants with MAF < 0.01, and X/Y chromosome SNPs, and ensured that all variants were biallelic.
To determine which SNPs were genotyped in the CoreExome, I first had to remove poorly genotyped variants (greater than 5% missingness). This was done by first extracting the IDs of the individuals to be studied in the CoreExome, as these IDs would be needed for accurately removing poorly genotyped variants.
# Loading phenotype file for CoreExome data.
CoreExPheno <- read_delim(here("Data/Phenotypes/CZ-MB1.2-QC1.10_MergedPhenotypes_20082020.txt"), delim = "\t")
# Loading IDs of all genotyped individuals for the CoreExome.
All_CoreEx_ID <- read_delim("/Volumes/archive/merrimanlab/raid_backup/New_Zealand_Chip_data/CoreExome/QC_MergedBatches/Final_Data/CZ-MB1.2-QC1.10_CoreExome24-1.0-3_genotyped-QCd_rsIDconverted.fam", delim = " ", col_names = F)
# Extracting data on all European IDs from the CoreExome that were genotyped and that were part of the cohorts that we plan on studying, also removing those samples missing gout status.
CoreExPheno_Euro <- CoreExPheno %>%
filter(Geno.BroadAncestry == "European",
Geno.SampleID %in% All_CoreEx_ID$X2,
General.Use != "No",
!Pheno.Study %in% c("Auckland Controls", "Australian Controls", "ESR", "Rheumatoid Arthritis"),
!is.na(Pheno.GoutSummary))
# Extracting data on all Oceanian IDs from the CoreExome that were genotyped and that were part of the cohorts that we plan on studying, also removing those samples missing gout status.
CoreExPheno_Poly <- CoreExPheno %>%
filter(Geno.BroadAncestry == "Oceanian",
Geno.SampleID %in% All_CoreEx_ID$X2,
General.Use != "No",
!Pheno.Study %in% c("ESR", "Pacific Trust"),
!is.na(Pheno.GoutSummary))
# Combining European and Polynesian phenotype files together then extracting FID and IID columns.
all_coreex_ids <- rbind(CoreExPheno_Euro, CoreExPheno_Poly) %>%
select(Geno.FamilyID, Geno.SampleID)
# Writing out FID and IID columns as a txt file for filtering.
write_delim(all_coreex_ids, delim = "\t", file = path(scratch_path, "Output/Temp/all_coreex_ids.txt"), col_names = F)
# Cleaning up environment.
rm(CoreExPheno, CoreExPheno_Euro, CoreExPheno_Poly, all_coreex_ids, All_CoreEx_ID)
Next, we used PLINK to filter the CoreExome genotype file, keeping the list of IDs that we generated above. We then use these IDs to filter out genotypes that are missing in more than 5% of individuals.
# Setting directory as a variable.
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
# Filtering CoreExome PLINK files to only keep IDs in all_coreex_ids.txt, then filtering out genotypes with more than 5% missingness.
plink1.9b4.9 --bfile /Volumes/archive/merrimanlab/raid_backup/New_Zealand_Chip_data/CoreExome/QC_MergedBatches/Final_Data/CZ-MB1.2-QC1.10_CoreExome24-1.0-3_genotyped-QCd_rsIDconverted \
--keep $PRS_SCRATCH/Output/Temp/all_coreex_ids.txt \
--geno 0.05 \
--make-bed --out $PRS_SCRATCH/Output/Temp/inCoreExGeno
The resulting list of variants was then used to filter the gout GWAS summary statistics prior to reading the summary statistics file into R.
# Setting directory as a variable.
UKBB_DIR=/Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474
PRS_DIR=/Volumes/archive/userdata/student_users/nicksumpter/Documents/PhD/PRS
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
# Reordering the summary statistics columns to put chr and bp next to each other (and removing useless columns).
cat $UKBB_DIR/splits/GWAS_results/gout/ukbb_gout-allcontrol_chr1-22.X.XY.add_unfiltered_p.tsv | awk -v OFS="\t" '{print $1, $3, $2, $4, $7, $8, $9, $10, $11, $12}' > $PRS_SCRATCH/Data/GWAS/ukbb_gout_sumstat.tsv
# Reordering CoreExome bim file for filtering.
cat $PRS_SCRATCH/Output/Temp/inCoreExGeno.bim | awk -v OFS="\t" '{print $1, $4}' > $PRS_SCRATCH/Output/Temp/for_filtering.txt
# Filtering the summary statistics to only include variants that were genotyped in the CoreExome file (using chromosome and BP location).
cat $PRS_SCRATCH/Data/GWAS/ukbb_gout_sumstat.tsv | grep -wFf $PRS_SCRATCH/Output/Temp/for_filtering.txt > $PRS_SCRATCH/Output/Temp/filtered_ukbb_gout_sumstat.tsv
# Reading in filtered summary statistics for gout GWAS.
sumstat <- vroom(path(scratch_path, "Output/Temp/filtered_ukbb_gout_sumstat.tsv"),
delim = "\t",
col_names = FALSE) %>%
rename(CHR = X1,
BP = X2,
SNP = X3,
A1 = X4,
OR = X5,
SE = X6,
L95 = X7,
U95 = X8,
STAT = X9,
P = X10)
# Performing initial filtering of summary statistics to keep only autosomal variants that remained after filtering for high quality CoreExome genotypes. Additionally, some early filtering for indels was performed by keeping only alleles that had length of 1. Finally cleaning up the SNP column by separating into 2 columns using the comma delimiter. This resulted in 307,368 remaining variants.
sumstat2 <- sumstat %>%
filter(CHR %in% 1:22,
str_length(A1) == 1) %>%
separate(SNP, into = c("SNP1", "SNP2", "SNP3"), sep = ",", remove = FALSE)
# Checking what third column contained (just chromosomes, so can remove).
table(sumstat2$SNP3)
# Removing third SNP column.
sumstat2 <- sumstat2 %>%
select(-SNP3)
# Extracting 307,066 variants with an rsID in the SNP1 column.
rsid_col1 <- sumstat2 %>%
filter(str_detect(SNP1, regex("^rs[0-9]+")))
# Extracting 74,955 variants with an rsID in the SNP1 and SNP2 columns.
rsid_rsid <- rsid_col1 %>%
filter(str_detect(SNP2, regex("^rs[0-9]+")))
# Extracting 63 variants with two different rsIDs, for the most part SNP1 appears to be the newest rsID, but I will keep the extra rsID in a separate column.
different_rsids <- rsid_rsid %>%
filter(SNP1 != SNP2)
# Extracting 231,166 variants with an rsID in the SNP1 column and a SNP ID of the format chr:bp_a1_a2 in the SNP2 column.
rsid_snpid <- rsid_col1 %>%
filter(str_detect(SNP2, regex("^[0-9]+:[0-9]+_[ACGT]+_[ACGT]+$")))
# Splitting SNP2 column into various parts.
rsid_snpid <- rsid_snpid %>%
separate(SNP2, into = c("CHR2", "Extra"), sep = ":", remove = FALSE, convert = TRUE) %>%
separate(Extra, into = c("BP2", "Allele1", "Allele2"), sep = "_", convert = TRUE)
# Showing that the BP and BP2 columns are equal.
sum(rsid_snpid$BP != rsid_snpid$BP2)
# Showing that the CHR and CHR2 columns are equal.
sum(rsid_snpid$CHR != rsid_snpid$CHR2)
# Testing whether the A1 and Allele2 columns are equal shows that they are not always equal.
sum(rsid_snpid$A1 != rsid_snpid$Allele2)
# Removing CHR2 and BP2 columns then further filtering out indels based on Allele1 and Allele2 columns. This removes a further 13 indels resulting in 231,153 variants.
rsid_snpid <- rsid_snpid %>%
select(-CHR2, -BP2) %>%
filter(str_length(Allele1) == 1,
str_length(Allele2) == 1)
# Isolating SNP, Allele1 and Allele2 columns.
rsid_snpid <- rsid_snpid %>%
select(SNP, Allele1, Allele2)
# Filtering to keep 945 variants with neither rsID nor chr:bp_a1_a2 in the SNP2 column.
rsid_other <- rsid_col1 %>%
filter(!str_detect(SNP2, regex("^[0-9]+:[0-9]+_[ACGT]+_[ACGT]+$|^rs[0-9]+$")))
# Showing that all of these remaining SNPs are in the format Affx-<number>.
sum(!str_detect(rsid_other$SNP2, regex("^Affx-[0-9]+$")))
# Making clean final list of variants from those with an rsID in the SNP1 column, ensuring that we remove indels that were removed in the steps above.
rsid_col1_final <- rsid_col1 %>%
mutate(RSID = SNP1,
ALT_RSID = case_when(str_detect(SNP2, regex("^rs[0-9]+$")) & SNP1 != SNP2 ~ SNP2,
TRUE ~ NA_character_),
AFFYID = case_when(str_detect(SNP2, regex("^Affx-[0-9]+$")) ~ SNP2,
TRUE ~ NA_character_),
SNP_ID = case_when(str_detect(SNP2, regex("^[0-9]+:[0-9]+_[ACGT]+_[ACGT]+$")) ~ SNP2,
TRUE ~ NA_character_)) %>%
left_join(rsid_snpid, by = "SNP") %>%
filter(is.na(SNP_ID) | (!is.na(SNP_ID) & !is.na(Allele1)))
rm(rsid_col1, rsid_other, rsid_rsid, rsid_snpid, different_rsids)
# Extracting variants without an rsID in the SNP1 column. 302 don't have an rsID in SNP1 column.
not_rsid_col1 <- sumstat2 %>%
filter(!str_detect(SNP1, regex("^rs[0-9]+")))
# Extracting variants with snp_id format in the SNP1 column. 300 have the snp_id format in SNP1.
snpid_col1 <- not_rsid_col1 %>%
filter(str_detect(SNP1, regex("^[0-9]+:[0-9]+_[ACGT]+_[ACGT]+$")))
# Splitting up the snp_id column.
snpid_col1 <- snpid_col1 %>%
separate(SNP1, into = c("CHR2", "Extra"), sep = ":", remove = FALSE, convert = TRUE) %>%
separate(Extra, into = c("BP2", "Allele1", "Allele2"), sep = "_", convert = TRUE)
# Showing that the BP and BP2 columns are equal.
sum(snpid_col1$BP != snpid_col1$BP2)
# Showing that the CHR and CHR2 columns are equal.
sum(snpid_col1$CHR != snpid_col1$CHR2)
# Testing whether the A1 and Allele2 columns are equal shows that they are not always equal.
sum(snpid_col1$A1 != snpid_col1$Allele2)
# Removing indels (300 down to 288 remaining - 12 removed)
snpid_col1 <- snpid_col1 %>%
select(-CHR2, -BP2) %>%
filter(str_length(Allele1) == 1,
str_length(Allele2) == 1)
# Isolating SNP, Allele1 and Allele2 columns.
snpid_col1_final <- snpid_col1 %>%
select(SNP, Allele1, Allele2)
# Isolating 7 of the 288 variants with an rsID in the SNP2 column.
snpid_rsid <- snpid_col1 %>%
filter(str_detect(SNP2, regex("^rs[0-9]+")))
# Isolating 280 of the 288 variants with a snp_id in the SNP2 column.
snpid_snpid <- snpid_col1 %>%
filter(str_detect(SNP2, regex("^[0-9]+:[0-9]+_[ACGT]+_[ACGT]+$")))
# Showing that all of these variants have identical SNP1 and SNP2 columns.
sum(snpid_snpid$SNP1 != snpid_snpid$SNP2)
# Remaining SNP has an affy ID in SNP2 column.
snpid_affyid <- snpid_col1 %>%
filter(str_detect(SNP2, regex("^Affx-[0-9]+$")))
# Two total SNPs have an affy ID in the SNP1 column.
affyid_col1 <- not_rsid_col1 %>%
filter(str_detect(SNP1, regex("^Affx-[0-9]+$")))
# Both have an affy ID in the SNP2 column.
affyid_affyid <- affyid_col1 %>%
filter(str_detect(SNP2, regex("^Affx-[0-9]+$")))
# Showing that both of these variants have identical SNP1 and SNP2 columns.
sum(affyid_affyid$SNP1 != affyid_affyid$SNP2)
# Creating final list of SNPs without an rsID in column 1, ensuring that we remove indels that were removed in the steps above.
not_rsid_col1_final <- not_rsid_col1 %>%
mutate(RSID = case_when(str_detect(SNP2, regex("^rs[0-9]+$")) ~ SNP2, TRUE ~ NA_character_),
ALT_RSID = NA_character_,
AFFYID = case_when(str_detect(SNP2, regex("^Affx-[0-9]+$")) ~ SNP2, TRUE ~ NA_character_),
SNP_ID = case_when(str_detect(SNP1, regex("^[0-9]+:[0-9]+_[ACGT]+_[ACGT]+$")) ~ SNP1, TRUE ~ NA_character_)) %>%
left_join(snpid_col1_final, by = "SNP") %>%
filter(is.na(SNP_ID) | (!is.na(SNP_ID) & !is.na(Allele1)))
# Combining two SNP lists back together resulting in 307,343 variants
sumstat3 <- rbind(rsid_col1_final, not_rsid_col1_final) %>%
arrange(CHR, BP)
# Cleaning up the environment.
remove <- ls()
remove <- as_tibble(remove) %>%
filter(str_detect(value, "col"))
remove <- remove$value
rm(list = remove, remove)
rm(affyid_affyid, snpid_affyid, snpid_rsid, snpid_snpid)
# Pulling out unique chr/bp coordinates, leaves 306,462 locations, suggesting 881 duplicate locations.
tmp <- sumstat3 %>%
select(CHR, BP) %>%
unique()
# Writing out list of locations for each chromosome.
for(i in 1:22) {
write_delim(select(filter(tmp, CHR == i), BP), file = paste0(path(scratch_path, "Output/Temp"), "/chr", i, "_snplist.txt"), delim = "\n")
}
# Setting directory as a variable.
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
# Now to pull out the MAF and alleles for all SNPs using the mfi files (based on the SNP location files we just created).
parallel "grep -Fwhf {1}/Output/Temp/chr{2}_snplist.txt /Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474/splits/ukb_mfi_chr{2}_v3.txt > {1}/Output/Temp/ukb_maf_info_chr{2}.txt" ::: $PRS_SCRATCH ::: {1..22}
# Reading back in the filtered mfi files and combining them together (total of 319,376 variants).
out <- tibble()
for(i in 1:22) {
assign(paste0("chr", i, "_snps"), read_delim(path(scratch_path, "Output/Temp", paste0("ukb_maf_info_chr", i, ".txt")), delim = "\t", col_names = FALSE) %>% mutate(CHR = i))
out <- rbind(out, get(paste0("chr", i, "_snps")))
rm(list = paste0("chr", i, "_snps"), i)
}
# Updating the column names of this file.
colnames(out) <- c("SNP1_mfi", "SNP2_mfi", "BP_mfi", "Allele1_mfi", "Allele2_mfi", "MAF_mfi", "Minor_Allele_mfi", "INFO_mfi", "CHR_mfi")
# Showing that all have allele1 and allele2.
sum(is.na(out$Allele1_mfi) | is.na(out$Allele2_mfi))
# Removing any remaining indels, any variants with less than 1% allele frequency in the UK Biobank, and any variants with less than 0.3 INFO scores (indicating poor imputation quality). This results in 246,869 variants.
out <- out %>%
mutate(CHR_BP = paste0(CHR_mfi, "_", BP_mfi)) %>%
filter(Allele1_mfi %in% c("A", "C", "G", "T"),
Allele2_mfi %in% c("A", "C", "G", "T"),
MAF_mfi > 0.01,
MAF_mfi < 0.99,
INFO_mfi > 0.3)
# 224 of these are multi-allelic sites.
sum(duplicated(out$CHR_BP))
# Adding information from above and filtering out missing variants. Results in 246,017 variants.
sumstat4 <- sumstat3 %>%
mutate(CHR_BP = paste0(CHR, "_", BP)) %>%
left_join(out, by = "CHR_BP") %>%
filter(!is.na(Allele1_mfi)) %>%
select(-CHR_mfi, -BP_mfi)
# 823 duplicated sites are in the sumstat4 file.
tmp <- sumstat4 %>%
filter(duplicated(CHR_BP)) %>%
pull(CHR_BP) %>%
unique()
# A total of 244,015 locations have only one SNP according to the mfi files. All remaining variants will be removed from further analysis.
tmp2 <- sumstat4 %>%
filter(!(CHR_BP %in% tmp))
# Reading back in genotype file (bim file).
geno <- read_delim(path(scratch_path, "Output/Temp", "inCoreExGeno.bim"), delim = "\t", col_names = FALSE) %>%
mutate(CHR_BP = paste0(X1, "_", X4))
# Identifying locations that are biallelic in the genotype files (also removing indels for this comparison). Only 243,832 variants remain (183 fewer than tmp2).
geno2 <- geno %>%
filter(CHR_BP %in% tmp2$CHR_BP,
X5 %in% c("A", "C", "G", "T"),
X6 %in% c("A", "C", "G", "T"))
# No further duplicated variants remain.
tmp <- geno2 %>%
filter(duplicated(CHR_BP)) %>%
pull(CHR_BP) %>%
unique()
# Adding SNP ID and genotype information from bim file.
sumstat5 <- tmp2 %>%
left_join(geno2, by = "CHR_BP") %>%
select(-X1, -X3, -X4)
# Cleaning up environment.
rm(geno, geno2, out, tmp2, tmp3, tmp4, tmp)
# Identifying which allele columns are identical.
# A1 column has no missing data.
sum(is.na(sumstat5$A1))
# Allele1_mfi column has no missing data.
sum(is.na(sumstat5$Allele1_mfi))
# Allele2_mfi column has no missing data.
sum(is.na(sumstat5$Allele2_mfi))
# Allele1 column is missing in 47,667 rows.
sum(is.na(sumstat5$Allele1))
# Allele2 column is missing in 47,667 rows.
sum(is.na(sumstat5$Allele2))
# X5 column is missing in 183 rows.
sum(is.na(sumstat5$X5))
# X6 column is missing in 183 rows.
sum(is.na(sumstat5$X6))
# Allele1_mfi is identical to Allele1 but has no missing data, thus it should overwrite Allele1.
test <- sumstat5 %>%
filter(!is.na(Allele1),
Allele1 != Allele1_mfi)
# Updating columns based on this information.
sumstat5 <- sumstat5 %>%
select(-Allele1) %>%
rename(Allele1 = Allele1_mfi)
# Allele2_mfi is almost identical to Allele2 excluding missing data, thus it should overwrite Allele2. However, these 18 rows need to be removed as they don't match suggesting multi-allelic sites remain.
test <- sumstat5 %>%
filter(!is.na(Allele2),
Allele2 != Allele2_mfi)
# Updating columns based on this information and removing 18 mismatching rows.
sumstat5 <- sumstat5 %>%
select(-Allele2) %>%
rename(Allele2 = Allele2_mfi) %>%
filter(!(CHR_BP %in% test$CHR_BP))
# Some variants don't match the alleles between the mfi files and the genotype files (these should therefore be removed).
test <- sumstat5 %>%
filter((Allele1 == X5 | Allele1 == X6) & (Allele2 == X5 | Allele2 == X6))
# Filtering based on above information.
sumstat5 <- sumstat5 %>%
filter((Allele1 == X5 | Allele1 == X6) & (Allele2 == X5 | Allele2 == X6))
# All remaining variants match between the various sources of information.
test <- sumstat5 %>%
filter(A1 == X5 | A1 == X6)
# SNP1 and SNP2_mfi columns are identical.
sum(sumstat5$SNP1 != sumstat5$SNP2_mfi)
# SNP2 and SNP1_mfi columns are almost identical (48 different).
sum(sumstat5$SNP2 != sumstat5$SNP1_mfi)
# These are just the variants with the chromosome attached to the ID, can safely remove the mfi SNP names.
test <- sumstat5 %>%
filter(SNP2 != SNP1_mfi)
# Cleaning up the column names and removing redundant information.
sumstat5 <- sumstat5 %>%
rename(Effect_Allele = A1,
INFO = INFO_mfi,
MAF = MAF_mfi,
Minor_Allele = Minor_Allele_mfi,
BIM_ID = X2) %>%
select(-CHR_BP, -X5, -X6, -SNP1_mfi, -SNP2_mfi)
# In 1,419 cases that minor allele column doesn't match the effect allele.
test <- sumstat5 %>%
filter(Minor_Allele != Effect_Allele)
# All really close to 0.5 MAF, just need to flip OR, L95, and U95 then set Effect_Allele to Minor_Allele column
summary(test$MAF)
# Flipping allele order etc. for these variants.
test <- test %>%
mutate(OR = 1/OR,
tmp = 1/L95,
tmp2 = 1/U95,
L95 = tmp2,
U95 = tmp,
Effect_Allele = Minor_Allele) %>%
rename(EAF = MAF) %>%
select(-tmp, -tmp2, -Minor_Allele)
# Joining back together with remaining variants after cleaning up.
sumstat5_1 <- sumstat5 %>%
filter(Minor_Allele == Effect_Allele) %>%
select(-Minor_Allele) %>%
rename(EAF = MAF) %>%
rbind(test) %>%
arrange(CHR, BP)
# Changing allele columns to be named effect/alternate.
test <- sumstat5_1 %>%
filter(Allele2 == Effect_Allele) %>%
rename(Alternate_Allele = Allele1) %>%
select(CHR, SNP, BP, Effect_Allele, Alternate_Allele, OR:SNP_ID, BIM_ID, EAF:INFO)
# Changing allele columns to be named effect/alternate.
test2 <- sumstat5_1 %>%
filter(Allele1 == Effect_Allele) %>%
rename(Alternate_Allele = Allele2) %>%
select(CHR, SNP, BP, Effect_Allele, Alternate_Allele, OR:SNP_ID, BIM_ID, EAF:INFO)
# Joining back together into final summary statistic file.
sumstat_final <- rbind(test, test2) %>%
arrange(CHR, BP)
# Saving final summary statistic file.
save(sumstat_final, file = path(scratch_path, "Output/sumstat_final.RData"))
# Cleaning up environment.
rm(sumstat, sumstat2, sumstat3, sumstat4, sumstat5, sumstat5_1, test, test2)
To get from these cleaned up summary statistics to the final list of SNPs for the PRS, I did the following:
I filtered out all SNPs with P-values greater than 5e-8.
I took each lead SNP within a 1 Mb window and used these to define 15 crude loci.
This list of lead SNPs was further filtered to only include one lead SNP per full locus.
- The boundaries of these “full loci” were defined based on two consecutive genome-wide significant SNPs being more than 500 kb apart.
Next, SNPs in the UK Biobank BGEN files were extracted if they fit within the boundaries of these “full loci”.
Conditional GWAS were run at each locus, conditioning on the lead SNP.
If there was a significant SNP (P < 5e-8) remaining after conditioning, the original lead SNP and the new lead SNP were used for a subsequent conditional GWAS at this locus.
This was repeated until no more significant SNPs (P < 5e-8) remained at each locus after conditioning.
Locus zooms were plotted for each locus, using both the unconditioned and conditioned GWAS results.
Finally, the resulting list of 19 lead SNPs were saved in a single file ready for conversion to a PRS.
# Defining one SNP per locus.
# Keeping SNPs with P < 5e-8 SNPs and arranging from smallest P to largest P.
sumstat_signif <- sumstat_final %>%
filter(P <= 5e-8) %>%
arrange(P)
# Grouping into loci +- 500 kb of top SNPs.
# Extracting first row of file (most significant SNP).
gout_top <- sumstat_signif %>%
slice(1)
# Removing SNPs within 500 kb window of this lead SNP.
gout2 <- sumstat_signif %>%
filter(!(CHR == gout_top$CHR[1] & BP %in% ((gout_top$BP[1] - 500000):(gout_top$BP[1] + 500000))))
# Continuing this process until no variants remain.
while(nrow(gout2) > 0) {
tmp <- gout2 %>%
slice(1)
gout_top <- rbind(tmp, gout_top)
gout2 <- gout2 %>%
filter(!(CHR == gout_top$CHR[1] & BP %in% ((gout_top$BP[1] - 500000):(gout_top$BP[1] + 500000))))
}
# Arranging output list by chromosome and bp.
gout_top <- gout_top %>%
arrange(CHR, BP)
# Cleaning up environment.
rm(gout2, tmp)
# Finding regions of loci.
# First arranging significant summary stats by chromosome and bp.
sumstat_signif <- sumstat_signif %>%
arrange(CHR, BP)
# Next finding difference in position between each variant.
out <- NA
for(i in 2:nrow(sumstat_signif)) {
if(sumstat_signif$CHR[i] == sumstat_signif$CHR[i - 1]){
out[i] <- sumstat_signif$BP[i] - sumstat_signif$BP[i - 1]
} else {
out[i] <- NA
}
}
# Making column for filtering using differences of less than 500,000 bp between variants.
tmp <- sumstat_signif %>%
mutate(Diff = out,
Diff2 = case_when(Diff < 500000 ~ Diff))
# Keeping first and last significant variant of each locus based on differences in location between consecutive variants.
# Keeping first SNP as this marks the start of the first locus.
out <- sumstat_signif %>% slice(1)
# For all other SNPs, if there is a difference of greater than 500 kb with the previous SNP then extract that SNP and the previous SNP (as these define the boundaries of loci).
for(i in 2:nrow(sumstat_signif)) {
if(is.na(tmp$Diff2[i])){
out <- rbind(out, sumstat_signif %>% slice(i - 1), sumstat_signif %>% slice(i))
}
}
out <- rbind(out, sumstat_signif %>% slice(nrow(sumstat_signif)))
Running the conditional GWAS
# Running the conditional GWAS.
# Splitting up the PLINK files to have one locus per file (saves on computational time).
# Selecting columns of interest from gout_top.
gout_top2 <- gout_top %>%
select(CHR, BP1, BP2, RSID)
# Making files for extracting SNPs from PLINK files.
for(i in 1:nrow(gout_top2)){
# Saving each row as a tmp variable.
tmp <- gout_top2 %>% slice(i)
# Writing out tmp variable as a file for extracting the range of each variant.
write_delim(tmp, file = paste0(path(scratch_path, "Output/Temp"), "/extractrange_", tmp$RSID, ".txt"), delim = "\t", col_names = F)
}
# Making file with list of all chromosomes for bash script below.
write_delim(gout_top %>% select(CHR), file = path(scratch_path, "Output/Temp/all_chr.txt"), col_names = F)
# Making file with list of all SNPs for bash script below.
write_delim(gout_top %>% select(RSID), file = path(scratch_path, "Output/Temp/all_rsid.txt"), col_names = F)
# Making file with list of all split numbers for bash script below.
write_delim(tibble(paste0(0, 0:9)), file = path(scratch_path, "Output/Temp/split_numbers.txt"), col_names = F)
# Setting directory as a variable.
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
PRS_DIR=/Volumes/archive/userdata/student_users/nicksumpter/Documents/PhD/PRS
# Filtering PLINK files to keep SNPs of interest for each conditional GWAS.
parallel --xapply "plink1.9b6.10 --bfile {1}/Output/Temp/chr{2}_tmp --extract range {1}/Output/Temp/extractrange_{3}.txt --make-bed --out {1}/Output/Temp/{3}" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/all_chr.txt :::: $PRS_SCRATCH/Output/Temp/all_rsid.txt
# Cutting out list of SNPs from bim file then splitting the list into 10 parts.
parallel "cat {1}/Output/Temp/{2}.bim | cut -f 2 > {1}/Output/Temp/{2}_snps; split -d -n l/10 {1}/Output/Temp/{2}_snps {1}/Output/Temp/{2}_snps_split" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/all_rsid.txt
# Concatenating the conditioning SNP of interest onto each split file.
parallel "echo {2} >> {1}/Output/Temp/{2}_snps_split{3}" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/all_rsid.txt :::: $PRS_SCRATCH/Output/Temp/split_numbers.txt
# De-duplicating each split file.
parallel "cat {1}/Output/Temp/{2}_snps_split{3} | sort -u -o {1}/Output/Temp/{2}_snps_split{3}" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/all_rsid.txt :::: $PRS_SCRATCH/Output/Temp/split_numbers.txt
# Running conditional GWAS's in parallel.
parallel "plink1.9b6.10 --bfile {1}/Output/Temp/{2} --extract {1}/Output/Temp/{2}_snps_split{3} --logistic sex --ci 0.95 --covar {4}/Data/GWAS/gout_gwas_covar.covar --covar-name Age,pc1-pc40 --condition {2} --out {1}/Output/Temp/{2}_split{3}" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/all_rsid.txt :::: $PRS_SCRATCH/Output/Temp/split_numbers.txt ::: $PRS_DIR
# Listing file names of all the newly generated GWAS summary statistics.
file_names <- list.files(path(scratch_path, "Output/Temp/"))[str_detect(list.files(path(scratch_path, "Output/Temp/")), regex("rs[0-9]+_split[0-9]+.assoc.logistic"))]
# Reading all the GWAS summary statistics back into R.
for(i in file_names){
assign(i, read.table(paste0(path(scratch_path, "Output/Temp"), "/", i), header = T) %>% filter(TEST == "ADD"))
}
# Combining the summary statistics for each split together to give a complete summary stat file for each SNP. Also extracting the new lead SNP and adding a column to the unconditioned summary stats.
# Making temporary empty vector called tmp.
tmp <- c()
# Initiating for loop over each row of the gout_top table (i.e. for each of the 15 loci).
for(i in 1:nrow(gout_top)){
# Making temporary empty vector called tmp3.
tmp3 <- c()
# Initiating nested for loop over the values from 0 to 9.
for(j in 0:9){
# Save summary statistic for the current SNP and current split number as tmp2.
tmp2 <- get(paste0(gout_top$RSID[i], "_split0", j, ".assoc.logistic"))
# Concatenate each summary statistic together into tmp3.
tmp3 <- rbind(tmp3, tmp2)
}
# Remove any rows containing NA's (i.e. those conditioned on themselves) and save as <rsID>_gwas.
assign(paste0(gout_top$RSID[i], "_gwas"), tmp3 %>% na.omit())
# Extract the most significant SNP after conditioning along with its p-value.
tmp3 <- tmp3 %>% select(SNP, P) %>% arrange(P) %>% slice(1)
# Add the new lead SNP to the tmp table.
tmp <- rbind(tmp, tmp3)
}
# Renaming columns.
tmp <- tmp %>%
rename(new_lead = SNP, new_p = P)
# Adding new lead SNP and P-value columns.
gout_top2 <- gout_top %>%
cbind(tmp)
# Filtering to only keep SNPs that are still significant at P < 5e-8.
gout_top_resid <- gout_top2 %>%
filter(new_p < 5e-8)
# Cleaning up environment.
rm(list = ls()[str_detect(ls(), ".assoc")], i, tmp, tmp2, file_names, tmp3, j)
# Second round of conditioning.
# Making file with list of all remaining SNPs for bash script below.
write_delim(gout_top_resid %>% select(RSID), file = path(scratch_path, "Output/Temp/round1_rsid.txt"), col_names = F)
# Making file with list of all new lead SNPs for bash script below.
write_delim(gout_top_resid %>% select(new_lead), file = path(scratch_path, "Output/Temp/round1_rsid2.txt"), col_names = F)
# Setting directory as a variable.
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
PRS_DIR=/Volumes/archive/userdata/student_users/nicksumpter/Documents/PhD/PRS
# Concatenating the conditioning SNP of interest onto each split file.
parallel "echo {4} >> {1}/Output/Temp/{2}_snps_split{3}" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/round1_rsid.txt :::: $PRS_SCRATCH/Output/Temp/split_numbers.txt :::: $PRS_SCRATCH/Output/Temp/round1_rsid2.txt
# De-duplicating each split file.
parallel "cat {1}/Output/Temp/{2}_snps_split{3} | sort -u -o {1}/Output/Temp/{2}_snps_split{3}" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/round1_rsid.txt :::: $PRS_SCRATCH/Output/Temp/split_numbers.txt
# Writing out SNP IDs into a series of files for conditioning next round of GWAS.
parallel --xapply "echo $'{2}\n{3}' > {1}/Output/Temp/{2}_2" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/round1_rsid.txt :::: $PRS_SCRATCH/Output/Temp/round1_rsid2.txt
# Running second round of conditional GWAS's in parallel.
parallel "plink1.9b6.10 --bfile {1}/Output/Temp/{2} --extract {1}/Output/Temp/{2}_snps_split{3} --logistic sex --ci 0.95 --covar {4}/Data/GWAS/gout_gwas_covar.covar --covar-name Age,pc1-pc40 --condition-list {1}/Output/Temp/{2}_2 --out {1}/Output/Temp/{2}_split{3}_2" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/round1_rsid.txt :::: $PRS_SCRATCH/Output/Temp/split_numbers.txt ::: $PRS_DIR
# Listing file names of all the newly generated GWAS summary statistics.
file_names <- list.files(path(scratch_path, "Output/Temp/"))[str_detect(list.files(path(scratch_path, "Output/Temp/")), regex("rs[0-9]+_split[0-9]+_2.assoc.logistic"))]
# Reading all the new GWAS summary statistics back into R.
for(i in file_names){
assign(i, read.table(paste0(path(scratch_path, "Output/Temp"), "/", i), header = T) %>% filter(TEST == "ADD"))
}
# Combining the summary statistics for each split together to give a complete summary stat file for each SNP. Also extracting the new lead SNP and adding a column to the summary stats.
# Making temporary empty vector called tmp.
tmp <- c()
# Initiating for loop over each row of the gout_top_resid table (i.e. for each of the 3 loci).
for(i in 1:nrow(gout_top_resid)){
# Making temporary empty vector called tmp3.
tmp3 <- c()
# Initiating nested for loop over the values from 0 to 9.
for(j in 0:9){
# Save summary statistic for the current SNP and current split number as tmp2.
tmp2 <- get(paste0(gout_top_resid$RSID[i], "_split0", j, "_2.assoc.logistic"))
# Concatenate each summary statistic together into tmp3.
tmp3 <- rbind(tmp3, tmp2)
}
# Remove any rows containing NA's (i.e. those conditioned on themselves) and save as <rsID>_gwas2.
assign(paste0(gout_top_resid$RSID[i], "_gwas2"), tmp3 %>% na.omit())
# Extract the most significant SNP after conditioning along with its p-value.
tmp3 <- tmp3 %>% select(SNP, P) %>% arrange(P) %>% slice(1)
# Add the new lead SNP to the tmp table.
tmp <- rbind(tmp, tmp3)
}
# Renaming columns.
tmp <- tmp %>%
rename(new_lead2 = SNP, new_p2 = P)
# Adding new lead SNP and P-value columns.
gout_top3 <- gout_top_resid %>%
cbind(tmp)
# Filtering to only keep SNPs that are still significant at P < 5e-8.
gout_top_resid2 <- gout_top3 %>%
filter(new_p2 < 5e-8)
# Cleaning up environment.
rm(list = ls()[str_detect(ls(), ".assoc")], i, tmp, tmp2, file_names, tmp3, j)
# Third round of conditioning
# Making file with list of all remaining SNPs for bash script below.
write_delim(gout_top_resid2 %>% select(RSID), file = path(scratch_path, "Output/Temp/round2_rsid.txt"), col_names = F)
# Making file with list of all previous lead SNPs for bash script below.
write_delim(gout_top_resid2 %>% select(new_lead), file = path(scratch_path, "Output/Temp/round2_rsid2.txt"), col_names = F)
# Making file with list of all new lead SNPs for bash script below.
write_delim(gout_top_resid2 %>% select(new_lead2), file = path(scratch_path, "Output/Temp/round2_rsid3.txt"), col_names = F)
# Setting directory as a variable.
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
PRS_DIR=/Volumes/archive/userdata/student_users/nicksumpter/Documents/PhD/PRS
# Concatenating the conditioning SNP of interest onto each split file.
parallel "echo {4} >> {1}/Output/Temp/{2}_snps_split{3}" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/round2_rsid.txt :::: $PRS_SCRATCH/Output/Temp/split_numbers.txt :::: $PRS_SCRATCH/Output/Temp/round2_rsid3.txt
# De-duplicating each split file.
parallel "cat {1}/Output/Temp/{2}_snps_split{3} | sort -u -o {1}/Output/Temp/{2}_snps_split{3}" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/round1_rsid.txt :::: $PRS_SCRATCH/Output/Temp/split_numbers.txt
# Writing out SNP IDs into a series of files for conditioning next round of GWAS.
parallel --xapply "echo $'{2}\n{3}\n{4}' > {1}/Output/Temp/{2}_3" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/round2_rsid.txt :::: $PRS_SCRATCH/Output/Temp/round2_rsid2.txt :::: $PRS_SCRATCH/Output/Temp/round2_rsid3.txt
# Running third round of conditional GWAS's in parallel.
parallel "plink1.9b6.10 --bfile {1}/Output/Temp/{2} --extract {1}/Output/Temp/{2}_snps_split{3} --logistic sex --ci 0.95 --covar {4}/Data/GWAS/gout_gwas_covar.covar --covar-name Age,pc1-pc40 --condition-list {1}/Output/Temp/{2}_3 --out {1}/Output/Temp/{2}_split{3}_3" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/round2_rsid.txt :::: $PRS_SCRATCH/Output/Temp/split_numbers.txt ::: $PRS_DIR
# Listing file names of all the newly generated GWAS summary statistics.
file_names <- list.files(path(scratch_path, "Output/Temp/"))[str_detect(list.files(path(scratch_path, "Output/Temp/")), regex("rs[0-9]+_split[0-9]+_3.assoc.logistic"))]
# Reading all the new GWAS summary statistics back into R.
for(i in file_names){
assign(i, read.table(paste0(path(scratch_path, "Output/Temp"), "/", i), header = T) %>% filter(TEST == "ADD"))
}
# Combining the summary statistics for each split together to give a complete summary stat file for each SNP. Also extracting the new lead SNP and adding a column to the summary stats.
# Making temporary empty vector called tmp.
tmp <- c()
# Making temporary empty vector called tmp3.
tmp3 <- c()
# Initiating for loop over the values from 0 to 9.
for(j in 0:9){
# Save summary statistic for the current SNP and current split number as tmp2.
tmp2 <- get(paste0(gout_top_resid2$RSID, "_split0", j, "_3.assoc.logistic"))
# Concatenate each summary statistic together into tmp3.
tmp3 <- rbind(tmp3, tmp2)
}
# Remove any rows containing NA's (i.e. those conditioned on themselves) and save as <rsID>_gwas3.
assign(paste0(gout_top_resid2$RSID, "_gwas3"), tmp3 %>% na.omit())
# Extract the most significant SNP after conditioning along with its p-value.
tmp3 <- tmp3 %>% select(SNP, P) %>% arrange(P) %>% slice(1)
# Add the new lead SNP to the tmp table.
tmp <- rbind(tmp, tmp3)
# Renaming columns.
tmp <- tmp %>%
rename(new_lead3 = SNP, new_p3 = P)
# Adding new lead SNP and P-value columns.
gout_top4 <- gout_top_resid2 %>%
cbind(tmp)
# Filtering to only keep SNP's that are still significant at P < 5e-8.
gout_top_resid3 <- gout_top4 %>%
filter(new_p3 < 5e-8)
# Cleaning up environment.
rm(list = ls()[str_detect(ls(), ".assoc")], i, tmp, tmp2, file_names, tmp3, j)
Producing locus zooms for gout GWAS
# Loading in code and gene list. This was cloned from the https://github.com/Geeketics/LocusZooms repository.
source(here("Script/Functions/locus_zoom.R"))
# Loading in the UCSC gene list for annotating the locus zooms.
UCSC_GRCh37_Genes_UniqueList.txt <- read.delim(here("Data/GWAS/UCSC_GRCh37_Genes_UniqueList.txt"))
# Setting directory as a variable.
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
# Calculating LD for each original GWAS lead SNP vs all other SNP's in that chromosome.
parallel --xapply "plink1.9b6.10 --bfile {1}/Output/Temp/chr{2}_tmp --r2 inter-chr --ld-snp {3} --ld-window-r2 0 --out {1}/Output/Temp/chr{2}_{3}_ld" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/all_chr.txt :::: $PRS_SCRATCH/Output/Temp/all_rsid.txt
# Reading the LD reports back into R.
for(i in 1:nrow(gout_top)){
assign(paste0("chr", gout_top$CHR[i], "_", gout_top$RSID[i], "_ld"), read_table(paste0(path(scratch_path, "Output/Temp"), "/chr", gout_top$CHR[i], "_", gout_top$RSID[i], "_ld.ld")))
}
# Extracting new lead SNP IDs from first round of conditioning.
first_round <- gout_top_resid %>%
select(new_lead) %>%
rename(RSID = new_lead)
# Extracting new lead SNP IDs from second round of conditioning.
second_round <- gout_top_resid2 %>%
select(new_lead2) %>%
rename(RSID = new_lead2)
# Making full list of lead SNPs (including conditionally associated SNPs) and re-adding metadata.
gout_top_full <- gout_top %>%
select(RSID) %>%
rbind(first_round, second_round) %>%
left_join(loci, by = "RSID") %>%
arrange(CHR, BP)
# Plotting the locus zooms of the initial unconditioned GWAS.
# For each SNP in gout_top (15 SNPs total).
for(i in 1:nrow(gout_top)){
# Make temporary data file for locus zoom function.
tmp <- loci %>%
mutate(SNP = RSID) %>%
filter(!is.na(SNP), CHR == gout_top$CHR[i] & between(BP, gout_top$BP1[i], gout_top$BP2[i]))
# Plot locus zoom for the entirety of each locus, with no offset, using the ld files we just created, using the gene list we read in, naming the plot "Unconditioned <RSID> Locus Zoom", saving the plot as a jpg file of the form "Chr<chrnum>_<bp1>_<bp2>_<rsid>_unconditioned.jpg", also labeling all lead SNPs at that locus.
locus.zoom(data = tmp,
region = c(gout_top$CHR[i], gout_top$BP1[i], gout_top$BP2[i]),
offset_bp = 0,
ld.file = get(paste0("chr", gout_top$CHR[i], "_", gout_top$RSID[i], "_ld")),
genes.data = UCSC_GRCh37_Genes_UniqueList.txt,
plot.title = paste0("Unconditioned ", gout_top$RSID[i], " Locus Zoom"),
file.name = paste0(here("Output/Plots/"), "Chr", gout_top$CHR[i], "_", gout_top$BP1[i], "_", gout_top$BP2[i], "_", gout_top$RSID[i], "_unconditioned", ".jpg"),
secondary.snp = gout_top_full$RSID,
secondary.label = TRUE)
}
# Plotting the locus zooms of the first round of conditioning.
# Making file with list of all SNPs in new_lead column for bash script below.
write_delim(gout_top2 %>% select(new_lead), file = path(scratch_path, "Output/Temp/round1_all_rsid.txt"), col_names = F)
# Setting directory as a variable.
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
# Calculating LD for each newly conditioned GWAS lead SNP vs all other SNPs in that chromosome.
parallel --xapply "plink1.9b6.10 --bfile {1}/Output/Temp/chr{2}_tmp --r2 inter-chr --ld-snp {3} --ld-window-r2 0 --out {1}/Output/Temp/chr{2}_{3}_ld" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/all_chr.txt :::: $PRS_SCRATCH/Output/Temp/round1_all_rsid.txt
# Reading the LD reports back into R.
for(i in 1:nrow(gout_top2)){
assign(paste0("chr", gout_top2$CHR[i], "_", gout_top2$new_lead[i], "_ld"), read_table(paste0(path(scratch_path, "Output/Temp"), "/chr", gout_top2$CHR[i], "_", gout_top2$new_lead[i], "_ld.ld")))
}
# For each SNP in gout_top2 (15 SNPs total).
for(i in 1:nrow(gout_top2)){
# Plot locus zoom for the entirety of each locus, with no offset, using the ld files we just created, using the gene list we read in, naming the plot "Conditioned on <RSID>", saving the plot as a jpg file of the form "Chr<chrnum>_<bp1>_<bp2>_<rsid>_condition_<rsid>.jpg", also labeling all lead SNPs at that locus.
locus.zoom(data = get(paste0(gout_top2$RSID[i], "_gwas")),
region = c(gout_top$CHR[i], gout_top2$BP1[i], gout_top2$BP2[i]),
offset_bp = 0,
ld.file = get(paste0("chr", gout_top2$CHR[i], "_", gout_top2$new_lead[i], "_ld")),
genes.data = UCSC_GRCh37_Genes_UniqueList.txt,
plot.title = paste0("Conditioned on ", gout_top2$RSID[i]),
file.name = paste0(here("Output/Plots/"), "Chr", gout_top2$CHR[i], "_", gout_top2$BP1[i], "_", gout_top2$BP2[i], "_", gout_top2$RSID[i], "_condition_", gout_top2$RSID[i], ".jpg"),
secondary.snp = gout_top_full$RSID,
secondary.label = TRUE)
}
# Plotting locus zooms of second round of conditioning.
# Making file with list of all SNPs in CHR column for bash script below.
write_delim(gout_top3 %>% select(CHR), file = path(scratch_path, "Output/Temp/round2_all_chr.txt"), col_names = F)
# Making file with list of all SNPs in new_lead column for bash script below.
write_delim(gout_top3 %>% select(new_lead2), file = path(scratch_path, "Output/Temp/round2_all_rsid.txt"), col_names = F)
# Setting directory as a variable.
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
# Calculating LD for each newly conditioned GWAS lead SNP vs all other SNPs in that chromosome.
parallel --xapply "plink1.9b6.10 --bfile {1}/Output/Temp/chr{2}_tmp --r2 inter-chr --ld-snp {3} --ld-window-r2 0 --out {1}/Output/Temp/chr{2}_{3}_ld" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/round2_all_chr.txt :::: $PRS_SCRATCH/Output/Temp/round2_all_rsid.txt
# Reading the LD reports back into R.
for(i in 1:nrow(gout_top3)){
assign(paste0("chr", gout_top3$CHR[i], "_", gout_top3$new_lead2[i], "_ld"), read_table(paste0(path(scratch_path, "Output/Temp"), "/chr", gout_top3$CHR[i], "_", gout_top3$new_lead2[i], "_ld.ld")))
}
# For each SNP in gout_top3 (3 SNPs total).
for(i in 1:nrow(gout_top3)){
# Plot locus zoom for the entirety of each locus, with no offset, using the ld files we just created, using the gene list we read in, naming the plot "Conditioned on <RSID> and <RSID>", saving the plot as a jpg file of the form "Chr<chrnum>_<bp1>_<bp2>_<rsid>_condition_<rsid>and<rsid>.jpg", also labeling all lead SNPs at that locus.
locus.zoom(data = get(paste0(gout_top3$RSID[i], "_gwas2")),
region = c(gout_top3$CHR[i], gout_top3$BP1[i], gout_top3$BP2[i]),
offset_bp = 0,
ld.file = get(paste0("chr", gout_top3$CHR[i], "_", gout_top3$new_lead2[i], "_ld")),
genes.data = UCSC_GRCh37_Genes_UniqueList.txt,
plot.title = paste0("Conditioned on ", gout_top3$RSID[i], " and ", gout_top3$new_lead[i]),
file.name = paste0(here("Output/Plots/"), "Chr", gout_top3$CHR[i], "_", gout_top3$BP1[i], "_", gout_top3$BP2[i], "_", gout_top3$RSID[i], "_condition_", gout_top3$RSID[i], "and", gout_top3$new_lead[i], ".jpg"),
secondary.snp = gout_top_full$RSID,
secondary.label = TRUE)
}
# Plotting locus zooms of third round of conditioning.
# Making file with list of all SNPs in CHR column for bash script below.
write_delim(gout_top4 %>% select(CHR), file = path(scratch_path, "Output/Temp/round3_all_chr.txt"), col_names = F)
# Making file with list of all SNPs in new_lead column for bash script below.
write_delim(gout_top4 %>% select(new_lead3), file = path(scratch_path, "Output/Temp/round3_all_rsid.txt"), col_names = F)
# Setting directory as a variable.
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
# Calculating LD for each newly conditioned GWAS lead SNP vs all other SNPs in that chromosome.
parallel --xapply "plink1.9b6.10 --bfile {1}/Output/Temp/chr{2}_tmp --r2 inter-chr --ld-snp {3} --ld-window-r2 0 --out {1}/Output/Temp/chr{2}_{3}_ld" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/round3_all_chr.txt :::: $PRS_SCRATCH/Output/Temp/round3_all_rsid.txt
# Reading the LD reports back into R.
for(i in 1:nrow(gout_top4)){
assign(paste0("chr", gout_top4$CHR[i], "_", gout_top4$new_lead3[i], "_ld"), read_table(paste0(path(scratch_path, "Output/Temp"), "/chr", gout_top4$CHR[i], "_", gout_top4$new_lead3[i], "_ld.ld")))
}
# Plotting locus zooms
# For each SNP in gout_top4 (1 SNP total).
for(i in 1:nrow(gout_top4)){
# Plot locus zoom for the entirety of the locus, with no offset, using the ld file we just created, using the gene list we read in, naming the plot "Conditioned on <RSID> and <RSID> and <RSID>", saving the plot as a jpg file of the form "Chr<chrnum>_<bp1>_<bp2>_<rsid>_condition_<rsid>and<rsid>and<rsid>.jpg", also labeling all lead SNPs at that locus.
locus.zoom(data = get(paste0(gout_top4$RSID[i], "_gwas3")),
region = c(gout_top4$CHR[i], gout_top4$BP1[i], gout_top4$BP2[i]),
offset_bp = 0,
ld.file = get(paste0("chr", gout_top4$CHR[i], "_", gout_top4$new_lead3[i], "_ld")),
genes.data = UCSC_GRCh37_Genes_UniqueList.txt,
plot.title = paste0("Conditioned on ", gout_top4$RSID[i], " and ", gout_top4$new_lead[i], " and ", gout_top4$new_lead2[i]),
file.name = paste0(here("Output/Plots/"), "Chr", gout_top4$CHR[i], "_", gout_top4$BP1[i], "_", gout_top4$BP2[i], "_", gout_top4$RSID[i], "_condition_", gout_top4$RSID[i], "and", gout_top4$new_lead[i], "and", gout_top4$new_lead2[i], ".jpg"),
secondary.snp = gout_top_full$RSID,
secondary.label = TRUE)
}
Combining all GWAS results together into final list.
# Extracting regions of all loci.
regions <- gout_top %>%
select(CHR, BP1, BP2)
# Making empty vector named out.
out <- c()
# For each row in regions (i.e. each of 15 loci).
for(i in 1:nrow(regions)){
# Adding the start and end coordinates of the corresponding region.
tmp <- gout_top_full %>%
filter(CHR == regions$CHR[i] & between(BP, regions$BP1[i], regions$BP2[i])) %>%
mutate(BP1 = regions$BP1[i],
BP2 = regions$BP2[i])
# Adding to the out vector.
out <- rbind(out, tmp)
}
# Save out as gout_top_full.
gout_top_full <- out
# For each of the loci with multiple SNPs, testing the association of each SNP after adjusting for all others at the locus.
# Need to write it such that it asks: for every locus with more than one SNP after conditioning, run an association test for each SNP conditioned on all other SNPs at that locus.
# Making empty vector.
out <- c()
# Looping over each duplicated locus start site.
for(i in unique(gout_top_full$BP1[duplicated(gout_top_full$BP1)])){
# Extracting all lead variants at a locus.
assign(paste0(i, "_locus"), gout_top_full %>% filter(BP1 == i) %>% rowid_to_column() %>% select(CHR, RSID, BP1, rowid))
# Adding to out table.
out <- rbind(out, get(paste0(i, "_locus")))
# Extracting the RSID list for the full locus.
tmp <- get(paste0(i, "_locus")) %>%
select(RSID)
# Writing out the RSIDs for the full locus in a file.
write_delim(tmp, file = paste0(path(scratch_path, "Output/Temp"), "/locus_", i, "_snps.txt"), delim = "\n", col_names = F)
# Looping over each row of the locus.
for(j in 1:NROW(get(paste0(i, "_locus")))){
# Making temporary list of SNPs that exclude the SNP in that row of the locus table.
tmp <- get(paste0(i, "_locus")) %>%
select(RSID) %>%
slice(-j)
# Writing out this list of variants.
write_delim(tmp, file = paste0(path(scratch_path, "Output/Temp"), "/locus_", i, "_snps_", j, ".txt"), delim = "\n", col_names = F)
}
}
# Writing out as CHR file.
write_delim(out %>% select(CHR), file = path(scratch_path, "Output/Temp/locus_chrs.txt"), delim = "\n", col_names = F)
# Writing out as BP1 file.
write_delim(out %>% select(BP1), file = path(scratch_path, "Output/Temp/locus_ids.txt"), delim = "\n", col_names = F)
# Writing out as BP1 file.
write_delim(out %>% select(rowid), file = path(scratch_path, "Output/Temp/locus_iteration.txt"), delim = "\n", col_names = F)
# Extracting CHR and BP1 for each unique locus.
tmp <- out %>%
select(CHR, BP1) %>%
filter(duplicated(BP1)) %>%
unique()
# Writing out as CHR file.
write_delim(tmp %>% select(CHR), file = path(scratch_path, "Output/Temp/locus_chr.txt"), delim = "\n", col_names = F)
# Writing out as BP1 file.
write_delim(tmp %>% select(BP1), file = path(scratch_path, "Output/Temp/locus_id.txt"), delim = "\n", col_names = F)
# Setting directory as a variable.
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
PRS_DIR=/Volumes/archive/userdata/student_users/nicksumpter/Documents/PhD/PRS
# Extracting SNPs of interest for each chromosome of interest.
parallel --xapply "plink1.9b6.10 --bfile {1}/Output/Temp/chr{2}_tmp --extract {1}/Output/Temp/locus_{3}_snps.txt --make-bed --out {1}/Output/Temp/chr{2}_locus{3}" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/locus_chr.txt :::: $PRS_SCRATCH/Output/Temp/locus_id.txt
# Running conditioned GWAS for each locus.
parallel --xapply "plink1.9b6.10 --bfile {1}/Output/Temp/chr{2}_locus{3} --logistic sex --ci 0.95 --covar {5}/Data/GWAS/gout_gwas_covar.covar --covar-name Age,pc1-pc40 --condition-list {1}/Output/Temp/locus_{3}_snps_{4}.txt --out {1}/Output/Temp/final_gwas_{3}_{4}" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/locus_chrs.txt :::: $PRS_SCRATCH/Output/Temp/locus_ids.txt :::: $PRS_SCRATCH/Output/Temp/locus_iteration.txt ::: $PRS_DIR
# Getting file names for all created final GWAS tables.
file_names <- list.files(path(scratch_path, "Output/Temp/"))[str_detect(list.files(path(scratch_path, "Output/Temp/")), "final_gwas_.+logistic")]
# Reading in GWAS summary stats and combining into a single table.
out <- c()
for(i in file_names){
assign(i, read.table(paste0(path(scratch_path, "Output/Temp"), "/", i), header = T) %>% filter(TEST == "ADD") %>% na.omit())
out <- rbind(out, get(i))
}
# Keeping columns of interest.
out <- out %>%
select(CHR, BP, OR:U95, P)
# Merging new SNP effects into full list of variants for PRS.
gout_top_final <- gout_top_full %>%
left_join(out, by = c("CHR", "BP")) %>%
mutate(OR = case_when(is.na(OR.y) ~ OR.x,
TRUE ~ OR.y),
OR_old = case_when(!is.na(OR.y) ~ OR.x),
SE = case_when(is.na(SE.y) ~ SE.x,
TRUE ~ SE.y),
SE_old = case_when(!is.na(SE.y) ~ SE.x),
L95 = case_when(is.na(L95.y) ~ L95.x,
TRUE ~ L95.y),
L95_old = case_when(!is.na(L95.y) ~ L95.x),
U95 = case_when(is.na(U95.y) ~ U95.x,
TRUE ~ U95.y),
U95_old = case_when(!is.na(U95.y) ~ U95.x),
P = case_when(is.na(P.y) ~ P.x,
TRUE ~ P.y),
P_old = case_when(!is.na(P.y) ~ P.x)) %>%
select(CHR, RSID, BP:Alternate_Allele, OR, SE, L95, U95, P, OR_old, SE_old, L95_old, U95_old, P_old, EAF, INFO, BP1, BP2)
# Flipping allele order, OR, L95, U95, and EAF for each variant with an OR under 1.
smallOR <- gout_top_final %>%
filter(OR < 1) %>%
mutate(OR = as.numeric(signif(1/OR, digits = 4)),
tmp_L = as.numeric(signif(1/L95, digits = 4)),
tmp_U = as.numeric(signif(1/U95, digits = 4)),
U95 = tmp_L,
L95 = tmp_U,
OR_old = as.numeric(round(1/OR_old, digits = 3)),
tmp_L_old = as.numeric(round(1/L95_old, digits = 3)),
tmp_U_old = as.numeric(round(1/U95_old, digits = 3)),
U95_old = tmp_L_old,
L95_old = tmp_U_old,
EAF = 1 - EAF) %>%
rename(allele2 = Effect_Allele,
allele1 = Alternate_Allele) %>%
rename(Alternate_Allele = allele2,
Effect_Allele = allele1) %>%
select(CHR:BP, Effect_Allele, Alternate_Allele, OR:BP2)
bigOR <- gout_top_final %>%
filter(OR > 1)
# Combining together both lists above.
gout_top_final <- full_join(smallOR, bigOR) %>%
arrange(CHR, BP)
# Labeling each locus based on the nearest gene in the UCSC gene list.
# Preparing the gene list (removing rRNA genes).
genelist <- UCSC_GRCh37_Genes_UniqueList.txt %>%
select(Chrom, Start, End, Gene) %>%
filter(!str_detect(Gene, "rRNA"))
# Making empty vector.
out <- c()
# Looping over each lead variant.
for(i in 1:nrow(gout_top_final)){
# Making temporary table filtered to keep only genes that directly overlap with the variant.
tmp <- genelist %>%
filter(Chrom == gout_top_final$CHR[i],
Start < gout_top_final$BP[i],
End > gout_top_final$BP[i])
# Making merged gene name if the variant overlaps more than one gene.
if(nrow(tmp) > 1){
tmp <- tmp %>%
mutate(Gene = paste(Gene, sep = "|")) %>%
slice(1)
}
# If the variant doesn't overlap a gene, extracting the nearest gene based on start/end coordinates.
if(nrow(tmp) == 0){
tmp <- genelist %>%
mutate(Diff_Start = abs(Start - gout_top_final$BP[i]),
Diff_End = abs(End - gout_top_final$BP[i]),
Diff = case_when(Diff_Start < Diff_End ~ Diff_Start,
TRUE ~ Diff_End)) %>%
filter(Chrom == gout_top_final$CHR[i]) %>%
arrange(Diff) %>%
select(-Diff, -Diff_Start, -Diff_End) %>%
slice(1)
}
# Saving the final results to out.
out <- rbind(out, tmp)
}
# Adding the Nearest_Gene column based on the above results.
gout_top_final2 <- gout_top_final %>%
mutate(Locus_Name = out$Gene)
# Saving this list as UKBB_Gene_OR.
UKBB_Gene_OR <- gout_top_final2
# Saving UKBB_Gene_OR as an RData object.
save(UKBB_Gene_OR, file = here("Output/UKBB_Gene_OR.RData"))
# Cleaning up.
rm(list = ls()[str_detect(ls(), "_locus$|^chr|^final_|^gout_top|^rs")], bigOR, first_round, genelist, loci, out, regions, second_round, smallOR, tmp, file_names, i, j, sumstat_signif, sumstat_final)
Locus-Zooms
All of the locus zooms are plotted below in separate tabs:
# Reading in file paths of each locus zoom that has been created so far.
file_names <- list.files(here("Output/Plots"), full.names = T)[str_detect(list.files(here("Output/Plots"), full.names = T), "Chr")]
# Extracting information from file paths/names and ensuring they are in the correct order based on chr/bp while keeping their relative position in the file_names object.
tmp <- file_names %>%
as_tibble() %>%
separate(value, sep = "_", into = c(NA, "X2", "BP1", NA, NA, "Cond", "CondSNPs"), convert = TRUE) %>%
rowid_to_column() %>%
mutate(Cond = Cond == "condition",
CondSNPs = str_remove(CondSNPs, ".jpg")) %>%
separate(X2, sep = "/", into = c(NA, NA, NA, NA, NA, NA, NA, "CHR")) %>%
mutate(CHR = as.numeric(str_remove(CHR, "Chr"))) %>%
arrange(CHR, BP1, Cond, CondSNPs)
# Loading in the list of lead variants for the gout GWAS.
load(here("Output/UKBB_Gene_OR.RData"))
# Extracting loci with more than one locus name.
tmp1 <- UKBB_Gene_OR %>%
select(BP1, Locus_Name) %>%
unique() %>%
filter(duplicated(BP1) | duplicated(BP1, fromLast = TRUE))
# Collapsing these locus names into a single name for each locus.
# Making empty tibble called out.
out <- tibble()
# Looping over each locus with more than one locus name.
for(i in unique(tmp1$BP1)){
# Extracting the names for that locus.
names <- tmp1 %>%
filter(BP1 == i) %>%
pull(Locus_Name)
# Concatenating the names together separated by a | symbol.
name <- paste(names, collapse = "|")
# Writing out the locus position and new name.
out <- rbind(out, list(i, name))
}
# Setting the column names of the out object.
colnames(out) <- c("BP1", "Locus_Name")
# Combining the above with the remaining locus position/name combinations.
tmp2 <- UKBB_Gene_OR %>%
select(BP1, Locus_Name) %>%
unique() %>%
filter(!BP1 %in% tmp1$BP1) %>%
rbind(out)
# Adding the locus name column to the locus zoom information table.
tmp3 <- tmp %>%
left_join(tmp2, by = "BP1")
# Finalizing the locus zoom information table prior to plotting.
tmp4 <- tmp3 %>%
separate(CondSNPs, sep = "and", into = c("SNP1", "SNP2", "SNP3")) %>%
mutate(SNPs = case_when(is.na(SNP2) ~ SNP1,
!is.na(SNP2) & is.na(SNP3) ~ str_c(SNP1, SNP2, sep = " and "),
!is.na(SNP3) ~ str_c(SNP1, SNP2, SNP3, sep = " and ")),
Plot_Name = case_when(!Cond ~ paste0(Locus_Name, " (Uncond.)"),
Cond ~ paste0(Locus_Name, " (Cond. on ", SNPs, ")")))
# Reordering the file based on the position that we want them to be output (chr/bp increasing order).
file_names2 <- file_names[tmp$rowid]
# Setting the names of each file path to the name of the plot.
names(file_names2) <- tmp4$Plot_Name
# Creating text template for interpretation by Rmarkdown.
template <- c(
"#### {{nm}}\n",
"```{r, echo = FALSE}\n",
"include_graphics(file_names2['{{nm}}'])\n",
"```\n",
"\n"
)
# Making list of plots for displaying in the output HTML file.
plots <- lapply(
tmp4$Plot_Name,
function(nm) knit_expand(text = template)
)
PDZK1 (Uncond.)

PDZK1 (Cond. on rs10910845)

TRIM46 (Uncond.)

TRIM46 (Cond. on rs11264341)

GCKR (Uncond.)

GCKR (Cond. on rs1260326)

SFMBT1 (Uncond.)

SFMBT1 (Cond. on rs9847710)

SLC2A9|WDR1 (Uncond.)

SLC2A9|WDR1 (Cond. on rs7675964)

SLC2A9|WDR1 (Cond. on rs7675964 and rs6811287)

SLC2A9|WDR1 (Cond. on rs7675964 and rs6811287 and rs4481233)

ABCG2 (Uncond.)

ABCG2 (Cond. on rs2231142)

ABCG2 (Cond. on rs2231142 and rs10011796)

SLC17A1 (Uncond.)

SLC17A1 (Cond. on rs1165196)

ZSCAN31 (Uncond.)

ZSCAN31 (Cond. on rs853685)

MLXIPL (Uncond.)

MLXIPL (Cond. on rs3812316)

SLC16A9 (Uncond.)

SLC16A9 (Cond. on rs1171616)

SLC22A11|NRXN2 (Uncond.)

SLC22A11|NRXN2 (Cond. on rs17300741)

SLC22A11|NRXN2 (Cond. on rs17300741 and rs7937990)

RNASEH2C (Uncond.)

RNASEH2C (Cond. on rs4014195)

R3HDM2 (Uncond.)

R3HDM2 (Cond. on rs1106766)

MLXIP (Uncond.)

MLXIP (Cond. on rs28652632)

PNPLA3 (Uncond.)

PNPLA3 (Cond. on rs738409)

Summary of UK Biobank Gout GWAS analysis
In summary, I produced a list of 19 variants that collectively capture the independent genome-wide significant associations from the UK Biobank Gout GWAS summary statistics. These variants will be used to create a PRS for gout. Importantly, these variants were all genotyped on the either the Illumina CoreExome 24 v1.0, v1.1, or v1.3 array, or the OmniExome 8 v1.3 chip. This ensures that no genotype imputation was done on Polynesian individuals, which could bias results given the lack of a good reference panel for these populations. Three of the 15 total loci had more than one conditionally independent genome-wide significant signal. These were at or near SLC2A9 (3 hits), ABCG2 (2 hits), and SLC22A11 (2 hits).
Analysis of European Urate GWAS results from Tin et al. (2019)
Based on comments from reviewers, it was requested that I also produce a PRS based on the results of Tin et al., 2019. The following code details the following:
CoreExome genotyped PLINK files were filtered to exclude variants with over 10% missingness (up from 5% for the gout PRS) and those with MAF less than 0.01 in the entire CoreExome cohort. This was due to limitations of using a PRS with more loci, which results in a larger amount of individuals missing the PRS due to a higher chance of any one of the constituent variants being missing.
The locations of these filtered SNPs were extracted from the CoreExome bim file and this was filtered to only include SNPs that were also in the UK Biobank imputed genotype list.
Finally, the Tin et al. European summary statistics were filtered to only keep SNPs matching the chromosome and location of the above SNPs.
# Setting directory as a variable.
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
# Filtering CoreExome genotype file to only include variants with < 10% missingness and >1% frequency.
plink1.9b6.10 --bfile /Volumes/archive/merrimanlab/raid_backup/New_Zealand_Chip_data/CoreExome/QC_MergedBatches/Final_Data/CZ-MB1.2-QC1.10_CoreExome24-1.0-3_genotyped-QCd_rsIDconverted --make-bed --geno 0.1 --maf 0.01 --out $PRS_SCRATCH/Output/Temp/filtered_coreex
# Extracting the chromosome and position of each SNP in the filtered PLINK file, then changing tabs to spaces.
cut -f1,4 $PRS_SCRATCH/Output/Temp/filtered_coreex.bim | tr '\t' ' ' > $PRS_SCRATCH/Output/Temp/coreex_snps.txt
# For each chromosome, making a list of variant locations.
parallel "grep -w {1} {2}/Output/Temp/coreex_snps.txt | cut -d ' ' -f2 > {2}/Output/Temp/snplist_chr{1}.txt" ::: {1..22} ::: $PRS_SCRATCH
# Filtering for SNPs in the UK Biobank imputed genotype list.
parallel "cut -f3 /Volumes/scratch/merrimanlab/ukbio/EGAD00010001474/splits/ukb_mfi_chr{1}_v3.txt | grep -Fwf {2}/Output/Temp/snplist_chr{1}.txt | sed 's/^/{1} /' > {2}/Output/Temp/in_ukb_chr{1}.txt" ::: {1..22} ::: $PRS_SCRATCH
# Concatenating all the output txt files together.
cat $(ls -d $PRS_SCRATCH/Output/Temp/* | grep in_ukb) > $PRS_SCRATCH/Output/Temp/snplist.txt
# Further filtering out variants with less than 1% MAF then making a clean output file for reading into R.
cat <(head -n1 /Volumes/archive/merrimanlab/central_datasets/summary_gwas/urate/Tin_2019/cleaned/urate_chr1_22_LQ_IQ06_mac10_EA_60_rsid.txt) <(awk '$6 < 0.99 && $6 > 0.01' FS=' ' /Volumes/archive/merrimanlab/central_datasets/summary_gwas/urate/Tin_2019/cleaned/urate_chr1_22_LQ_IQ06_mac10_EA_60_rsid.txt | grep -Ff $PRS_SCRATCH/Output/Temp/snplist.txt) | sed 's/ /\t/g' > $PRS_SCRATCH/Output/Temp/tin_filtered.txt
The filtered Tin summary statistics were read back into R, then further cleaned up to ensure no rare variants remained
The summary statistics were used to define a list of top SNPs in a similar manner to the previous code for the UK Biobank gout GWAS
No conditional analyses were run on this list of summary statistics
In total, 82 common variants were defined as being associated with serum urate in Europeans, all of which were suitable for downstream analysis
# Reading the filtered summary stats back into R.
tin <- vroom(path(scratch_path, "Output/Temp/tin_filtered.txt"),
delim = "\t",
col_names = T)
# Filtering to keep only variants with an RSID.
test <- tin %>%
filter(str_detect(RSID, regex("^rs[0-9]+")))
# Adding SNP ID for variants without an RSID.
test2 <- tin %>%
filter(!(RSID %in% test$RSID)) %>%
mutate(RSID = paste0(Chr, Pos_b37, Allele1, Allele2))
# Joining the two tables back together and cleaning them up.
tin2 <- rbind(test, test2) %>%
arrange(Chr, Pos_b37) %>%
rename(CHR = Chr,
BP = Pos_b37,
P = `P-value`,
Beta = Effect,
SE = StdErr,
EAF = Freq1,
Effect_Allele = Allele1,
Alternate_Allele = Allele2) %>%
mutate(t = abs(Beta/SE))
# Checking the number of unique chr_bp combinations.
nrow(tin2 %>% select(CHR, BP) %>% unique()) # all unique
# Filtering to keep the significant variants that have MAF < 1% and ordering by t-value (as many P-values are 0).
sumstat_signif <- tin2 %>%
filter(P <= 5e-8,
EAF > 0.01,
EAF < 0.99) %>%
arrange(desc(t))
# Grouping into loci +- 500 kb of top SNPs.
# Extracting 1st row (most significant variant)
urate_top <- sumstat_signif %>%
slice(1)
# Filtering the summary stats to only keep variants outside of a +-500kb window from the top variant.
urate2 <- sumstat_signif %>%
filter(!(CHR == urate_top$CHR[1] & BP %in% ((urate_top$BP[1] - 500000):(urate_top$BP[1] + 500000))))
# Repeating this process until there are no remaining significant variants.
while(nrow(urate2) > 0) {
tmp <- urate2 %>%
slice(1)
urate_top <- rbind(tmp, urate_top)
urate2 <- urate2 %>%
filter(!(CHR == urate_top$CHR[1] & BP %in% ((urate_top$BP[1] - 500000):(urate_top$BP[1] + 500000))))
}
# Organizing the list by location.
urate_top <- urate_top %>%
arrange(CHR, BP)
# Finding regions of loci.
# Arranging significant summary stats by location.
sumstat_signif <- sumstat_signif %>%
arrange(CHR, BP)
# Setting out variable to NA.
out <- NA
# For each variant from the second to the end of the summary stats, comparing it with the previous variant and recording the difference in location if it is on the same chromosome.
for(i in 2:nrow(sumstat_signif)) {
if(sumstat_signif$CHR[i] == sumstat_signif$CHR[i - 1]){
out[i] <- sumstat_signif$BP[i] - sumstat_signif$BP[i - 1]
} else {
out[i] <- NA
}
}
# Adding difference column to summary stats and highlighting those variants within 500kb of each other.
tmp <- sumstat_signif %>%
mutate(Diff = out,
Diff2 = case_when(Diff < 500000 ~ Diff))
# Extracting the first significant variant in the summary stats (based on chr/bp).
out <- sumstat_signif %>% slice(1)
# For all remaining column, test if the Diff2 column is missing, if it is missing then take that variant and the previous variant for adding to the output. If it isn't missing then don't add it to the output. Variants missing this column are more than 500kb away from the previous variant.
for(i in 2:nrow(sumstat_signif)) {
if(is.na(tmp$Diff2[i])){
out <- rbind(out, sumstat_signif %>% slice(i - 1), sumstat_signif %>% slice(i))
}
}
# Add the final row to the output to complete the list of starts and ends of loci.
out <- rbind(out, sumstat_signif %>% slice(nrow(sumstat_signif)))
# Extracting regions of these loci for BGEN filtering of UK Biobank genotypes.
# Extracting the chromosome and position columns.
bgen_ranges <- out %>% select(CHR, BP)
# Taking every odd entry and relabel the BP to BP1.
tmp1 <- bgen_ranges %>% slice(seq(1, nrow(bgen_ranges), by = 2)) %>% rename(BP1 = BP)
# Taking every even entry and relabel the BP to BP2.
tmp2 <- bgen_ranges %>% slice(seq(2, nrow(bgen_ranges), by = 2)) %>% rename(CHR.x = CHR, BP2 = BP)
# Concatenating the two variant lists together and add a 50 kb buffer to each side of each locus.
bgen_ranges <- tmp1 %>%
cbind(tmp2) %>%
mutate(BP1 = BP1 - 50000,
BP2 = BP2 + 50000) %>%
select(-CHR.x)
# Making an empty vector called out.
out <- c()
# For all rows in bgen_ranges (i.e. all loci).
for(i in 1:nrow(bgen_ranges)){
# Extracting the most significant variant within that window.
tmp <- urate_top %>%
filter(CHR == bgen_ranges$CHR[i] & between(BP, bgen_ranges$BP1[i], bgen_ranges$BP2[i])) %>%
arrange(P) %>%
slice(1)
# Adding this variant to the out table.
out <- rbind(out, tmp)
}
# Concatenating the locus boundaries onto this output table.
urate_top <- out %>%
cbind(bgen_ranges %>% select(-CHR))
# Cleaning up.
rm(tmp, bgen_ranges, out, i)
# Adding confidence limits based on the effect +- 1.96 * the standard error.
urate_top <- urate_top %>%
mutate(L95 = Beta - 1.96 * SE,
U95 = Beta + 1.96 * SE)
# For all protective variants, flip the allele order and corresponding statistics so that the effect is also in the risk direction for urate.
smallOR <- urate_top %>%
filter(Beta < 0) %>%
mutate(Beta = as.numeric(signif(Beta * -1, digits = 4)),
tmp_L = as.numeric(signif(L95 * -1, digits = 4)),
tmp_U = as.numeric(signif(U95 * -1, digits = 4)),
U95 = tmp_L,
L95 = tmp_U,
EAF = 1 - EAF) %>%
rename(allele2 = Effect_Allele,
allele1 = Alternate_Allele) %>%
rename(Alternate_Allele = allele2,
Effect_Allele = allele1) %>%
select(CHR:BP, RSID, Effect_Allele, Alternate_Allele, Beta, L95, U95, SE:BP2)
# For all variants with a risk effect, clean up the table.
bigOR <- urate_top %>%
filter(Beta > 0) %>%
mutate(Beta = as.numeric(signif(Beta, digits = 4)),
L95 = as.numeric(signif(L95, digits = 4)),
U95 = as.numeric(signif(U95, digits = 4))) %>%
select(CHR:BP, RSID, Effect_Allele, Alternate_Allele, Beta, L95, U95, SE:BP2)
# Combine the two lists of variants together and clean up the table.
urate_top_final <- full_join(smallOR, bigOR) %>%
arrange(CHR, BP) %>%
select(-t, -n_total_sum) %>%
mutate(Effect_Allele = toupper(Effect_Allele),
Alternate_Allele = toupper(Alternate_Allele))
# Labeling each locus based on the nearest gene in the UCSC gene list.
# Preparing the gene list (removing rRNA genes).
genelist <- UCSC_GRCh37_Genes_UniqueList.txt %>%
select(Chrom, Start, End, Gene) %>%
filter(!str_detect(Gene, "rRNA"))
# Making empty vector.
out <- c()
# Looping over each lead variant.
for(i in 1:nrow(urate_top_final)){
# Making temporary table filtered to keep only genes that directly overlap with the variant.
tmp <- genelist %>%
filter(Chrom == urate_top_final$CHR[i],
Start < urate_top_final$BP[i],
End > urate_top_final$BP[i])
# Making merged gene name if the variant overlaps more than one gene.
if(nrow(tmp) > 1){
tmp <- tmp %>%
mutate(Gene = paste(Gene, sep = "|")) %>%
slice(1)
}
# If the variant doesn't overlap a gene, extracting the nearest gene based on start/end coordinates.
if(nrow(tmp) == 0){
tmp <- genelist %>%
mutate(Diff_Start = abs(Start - urate_top_final$BP[i]),
Diff_End = abs(End - urate_top_final$BP[i]),
Diff = case_when(Diff_Start < Diff_End ~ Diff_Start,
TRUE ~ Diff_End)) %>%
filter(Chrom == urate_top_final$CHR[i]) %>%
arrange(Diff) %>%
select(-Diff, -Diff_Start, -Diff_End) %>%
slice(1)
}
# Saving the final results to out.
out <- rbind(out, tmp)
}
# Adding the Nearest_Gene column based on the above results.
urate_top_final2 <- urate_top_final %>%
mutate(Locus_Name = out$Gene)
# Saving the result as Tin_Gene_OR.
Tin_Gene_OR <- urate_top_final2
# Saving the Tin_Gene_OR object.
save(Tin_Gene_OR, file = here("Output/Tin_Gene_OR.RData"))
# Cleaning up.
rm(list = ls()[str_detect(ls(), "^chr|^urate_top|^final_|gwas$")], bigOR, smallOR, tmp1, tmp2, sumstat_signif, urate2, test, test2, tin, tin2, i, file_names, file_names2, name, names, template, genelist, out, tmp, tmp3, tmp4, plots)
Preparing Phenotype files
The purpose of this section is to generate cleaned up phenotype files for each cohort, with the polygenic risk score (PRS) included for each. It contains the code for going from the raw phenotype and genotype data (in combination with the SNP lists generated in the previous section) to the finalized data frames for analysis.
The phenotypes of interest are the following (note some may be poorly phenotyped):
Self-reported gout status (i.e. gout vs control).
Self-reported age at collection.
Genetically determined sex.
Genetic principal components (all 10 global PCs and all 10 Oceanian PCs for Polynesians).
Self-reported age at gout onset.
- Disease duration derived from this and age at collection.
Self-reported tophaceous disease.
Self-reported flare frequency (number of flares in the last year).
Serum urate at collection.
Self-reported urate lowering therapy data (at collection).
Self-reported gout prophylaxis data (at collection).
Genetic ancestry data (i.e. European vs West Polynesian vs East Polynesian).
Comorbidity data, including hypertension, diabetes, heart disease (angina, myocardial infarction, or heart failure), kidney disease (serum creatinine/eGFR), dyslipidemia, stroke - including self report, medication, and metrics such as BMI (for descriptive stats table).
Lifestyle factors - total alcohol consumption, sugar-sweetened drink consumption, smoking status (for descriptive stats table).
Self-reported family history of gout.
Exclusion criteria:
- Genetic sex to self-report gender mismatch.
Initially, the list of individuals for studying was derived from a single phenotype file that had been prepared previously by Tanya. These cohorts were divided into European and Polynesian samples, then further subdivided into individual European cohorts, and East and West Polynesian cohorts. Below is a summary of which cohorts contributed to which combined cohorts, along with basic statistics for total cohort size and proportion of gout cases. Importantly, all statistics reported below are prior to exclusion for missing data.
Note that there were 232 other Ardea study participants (labeled 594 and 3170) that were not part of any of the main studies. They don’t have any of the phenotypes of interest so there was no point including them in the study.
Overall, the European cohorts of interest include 5,055 gout cases, along with 1,576 controls. These cohorts are comprised as follows:
The pooled Australian/New Zealand European cohort, comprising 2,741 total individuals (1,299 gout, 1,442 control) from the following cohorts:
- Gout in Aotearoa.
- AGRIA.
- Diabetes Mellitus.
- Renal Disease.
- Ngāti Porou Hauora trust.
- LPA.
The EuroGout cohort, including EireGout, a total of 2,053 individuals (1,919 gout, 134 controls (controls excluded from further analyses)).
The five Ardea Biosciences (Ardea) cohorts recruited from the following clinical trials (each was analyzed separately, with a total of 1,837 individuals (all gout) across all 5 cohorts):
- LASSO (908 gout).
- CLEAR 1 (293 gout) - People who are poor responders to allopurinol (everyone was on ULT at screening yet had > 6mg/dL).
- CLEAR 2 (310 gout) - People who are poor responders to allopurinol (everyone was on ULT at screening yet had > 6mg/dL).
- CRYSTAL (194 gout) - Two groups, 1 = same as CLEAR trials but also had >= 1 tophus, 2 = very HU people not on ULT with at least one tophus.
- LIGHT (132 gout) - People who cannot take allopurinol, some may be on other ULT at screening.
For Polynesian cohorts, three total cohorts were produced, excluding individuals of mixed East/West Polynesian ancestry, resulting in 1,380 gout cases, along with 1,269 controls:
The pooled West Polynesian gout cohort, comprising 906 West Polynesian individuals (492 gout, 414 control) from the following studies:
- AGRIA
- Ardea - LASSO
- Ardea - CLEAR 1 - People who are poor responders to allopurinol (everyone was on ULT at screening yet had > 6mg/dL).
- Ardea - CLEAR 2 - People who are poor responders to allopurinol (everyone was on ULT at screening yet had > 6mg/dL).
- Ardea - CRYSTAL - Two groups, 1 = same as CLEAR trials but also had >= 1 tophus, 2 = very HU people not on ULT with at least one tophus.
- Diabetes Mellitus
- Gout in Aotearoa
- LPA
- Ngāti Porou Hauora trust
- Renal Disease
The pooled East Polynesian gout cohort, comprising 1,422 East Polynesian individuals (666 gout, 756 control) from the following studies:
- AGRIA
- Ardea - LASSO
- Ardea - CLEAR 1 - People who are poor responders to allopurinol (everyone was on ULT at screening yet had > 6mg/dL).
- Ardea - CLEAR 2 - People who are poor responders to allopurinol (everyone was on ULT at screening yet had > 6mg/dL).
- Ardea - CRYSTAL - Two groups, 1 = same as CLEAR trials but also had >= 1 tophus, 2 = very HU people not on ULT with at least one tophus.
- Diabetes Mellitus
- Gout in Aotearoa
- LPA
- Renal Disease
The Ngāti Porou Hauora trust East Polynesian cohort, comprising 283 East Polynesian individuals (176 gout, 107 control) from the Ngāti Porou Hauora trust cohort.
# Making phenotype files.
# Loading CoreExome QC 1-10 phenotype file into R (this was made by Tanya).
CoreExPheno <- read_delim("/Volumes/archive/merrimanlab/raid_backup/New_Zealand_Chip_data/CoreExome/QC_MergedBatches/Final_Data/CZ-MB1.2-QC1.10_MergedPhenotypes_20082020.txt", delim = "\t") %>%
mutate(across(where(is_character), factor))
# Loading IDs of individuals genotyped on the CoreExome chip.
All_CoreEx_ID <- read_delim("/Volumes/archive/merrimanlab/raid_backup/New_Zealand_Chip_data/CoreExome/QC_MergedBatches/Final_Data/CZ-MB1.2-QC1.10_CoreExome24-1.0-3_genotyped-QCd_rsIDconverted.fam", delim = " ", col_names = F)
# Extracting information on the genotyped individuals from European cohorts of interest.
CoreExPheno_Euro <- CoreExPheno %>%
filter(Geno.BroadAncestry == "European",
Geno.SampleID %in% All_CoreEx_ID$X2,
General.Use != "No",
!(Pheno.Study %in% c("Auckland Controls", "Australian Controls", "ESR", "Rheumatoid Arthritis")))
# Extracting information on the genotyped individuals from Polynesian cohorts of interest.
CoreExPheno_Poly <- CoreExPheno %>%
filter(Geno.BroadAncestry == "Oceanian",
Geno.SampleID %in% All_CoreEx_ID$X2,
General.Use != "No",
!(Pheno.Study %in% c("ESR", "Pacific Trust")))
# Combining the above European and Polynesian cohort information together into one phenotype file, excluding individuals with unknown genetic sex, and excluding those without gout phenotype information. Columns of interest were extracted at this step.
CoreExPheno_Final <- full_join(CoreExPheno_Euro, CoreExPheno_Poly) %>%
filter(Geno.GeneticSex != "Unknown",
!is.na(Pheno.GoutSummary)) %>%
mutate(Pheno.GoutSummary = factor(case_when(Pheno.GoutSummary == "Gout" ~ "Gout",
Pheno.GoutSummary %in% c("Control", "HyperU") ~ "Control")),
across(where(is.factor), factor)) %>%
select(Pheno.SampleID:Pheno.UrateTherapy, GenStudio.ChipType, GenStudio.CallRate:Notes)
# Cleaning up.
rm(CoreExPheno, CoreExPheno_Euro, CoreExPheno_Poly, All_CoreEx_ID)
# Investigating basic statistics on each cohort to be analyzed.
aus_nz <- CoreExPheno_Final %>%
filter(Geno.BroadAncestry == "European",
Pheno.Study %in% c("Gout in Aotearoa", "AGRIA", "Diabetes Mellitus", "Renal Disease", "Ngati Porou", "LPA"))
eurogout <- CoreExPheno_Final %>%
filter(Geno.BroadAncestry == "European",
Pheno.Study %in% c("EuroGout", "EireGout"))
ardea <- CoreExPheno_Final %>%
filter(Geno.BroadAncestry == "European",
str_detect(Pheno.Study, "Ardea"),
!str_detect(Pheno.Study, "594|3170"))
westpoly <- CoreExPheno_Final %>%
filter(Geno.PolynesianGroup == "West Polynesian",
!str_detect(Pheno.Study, "594|3170"))
eastpoly <- CoreExPheno_Final %>%
filter(Geno.PolynesianGroup == "East Polynesian",
Pheno.Study != "Ngati Porou",
!str_detect(Pheno.Study, "594|3170"))
eastpoly_np <- CoreExPheno_Final %>%
filter(Geno.PolynesianGroup == "East Polynesian",
Pheno.Study == "Ngati Porou")
# Cleaning up.
rm(aus_nz, eurogout, ardea, eastpoly, westpoly, eastpoly_np)
Below is the code for preparing the phenotype files based on the heterogeneous phenotype files downloaded from BC SNPmax. First I will clean up phenotype information for all cohorts that contribute to the Australian/New Zealand Aotearoa cohort (including Gout in Aotearoa, AGRIA, DM, LPA, Ngati Porou, and RD). Next I will clean up the EuroGout cohort information. Finally, I will clean up the Ardea cohort phenotype files, split into Ironwood (CLEAR 1, CLEAR 2, CRYSTAL, and LIGHT), and LASSO. Throughout, disease duration will be derived from age at onset - age at recruitment + 1 (because it is derived from age in years, there is up to 1 additional year duration that needs to be accounted for).
# Making functions for converting Boolean variables into TRUE/FALSE from 2/1 or 1/0.
logicfactor <- function(x) {
as.logical(factor(x, levels = c(1, 2), labels = c("FALSE", "TRUE")))
}
logicfactor2 <- function(x) {
as.logical(factor(x, levels = c(0, 1), labels = c("FALSE", "TRUE")))
}
# Gout in Aotearoa.
# Making temporary phenotype file for the Gout in Aotearoa cohort based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "Gout in Aotearoa")
# Extracting IDs from the SNPmax Gout in Aotearoa phenotype file based on the CoreExome file. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
aotearoa_pheno <- read_delim(here("Data/Phenotypes/NZPheno.txt"), delim = "\t", guess_max = 5000) %>%
filter(SUBJECT %in% tmp$Pheno.SampleID) %>%
select(SUBJECT, DATEARR, DOB, AGECOL, DIABETES, FAMGOUT, FAMGOUT3:HIBP, HIBPTREAT:FRUSEMIDE, BUMETANIDE, THIAZIDEDIURETIC:BENDROFLUAZIDE, HCTZ, METOLAZONE, CHLORHALIDONE, INDAPAMIDE, OTHDIURETIC, SPIRONOLACTONE, AMILORIDE, ACETAZOLAMIDE, DIURETICCOMMENT:DIURRECRUIT, LIPIDS, LIPIDLOWER:BILEACIDSEQ, HEART:STROKE, KIDNEY:HEALTHOTH, SUGDRINK, SMOKER:OTHALCO, WEIGHT:HEIGHT, BMI:BMICALC, MRURATE:MRCREATDATE, GOUTCRITERIAB, SUSTOPHUS:DIURGOUT, ALLOPCURRENT, PROBENCURRENT, BENZBROCURRENT, FEBUXCURRENT, OTHULTCURRENT, CURULTCOMMENT:ALLOPINTOLERANCE, ALLOPSIDE, URATEDOX:HIGHESTSUDATE, CHOLES:TRIGLY, SCREAT:SURICACID, URATE1MONTH, RELATEDFILTER:RELATED) %>%
left_join(tmp, by = c("SUBJECT" = "Pheno.SampleID")) %>%
rename(IID = SUBJECT,
GOUT = Pheno.GoutSummary,
SEX = Geno.GeneticSex) %>%
mutate(across(where(is_character), factor),
across(c(FAMGOUT, FAMGOUT3, HIBP, DIURETIC:ACETAZOLAMIDE, LIPIDS:KIDNEY, FATTYLIVER, GOUTCRITERIAB:SUSTOPHUS, TOPHUS, ALLOPCURRENT:OTHULTCURRENT, ALLOPINTOLERANCE),
logicfactor),
AGE1ATK = case_when(!is.na(AGE1ATK) ~ AGE1ATK,
TRUE ~ AGECOL - DURATION),
DURATION = AGECOL - AGE1ATK + 1,
NUMATK = NUMATK,
TOPHIGOUT = GOUTCRITERIAB | SUSTOPHUS | TOPHUS,
EROSIONS = NA,
URATE = case_when(is.na(SURICACID) ~ rowMeans(across(c(MRURATE, URATEDOX, PREULTURATE, HIGHESTSU, URATE1MONTH)), na.rm = TRUE) * 1000 / 59.48,
TRUE ~ SURICACID * 1000 / 59.48),
ULT = ALLOPCURRENT | PROBENCURRENT | BENZBROCURRENT | FEBUXCURRENT | OTHULTCURRENT,
PROPHY = NA,
HEIGHT = HEIGHT / 100,
BMI = case_when(!is.na(BMI) ~ BMI,
TRUE ~ WEIGHT / (HEIGHT * HEIGHT)),
HYPERTENSION = HIBP | !is.na(HIBPTREAT) | DIURETIC | DIURETICCURRENT | LOOPDIURETIC | FRUSEMIDE | BUMETANIDE | THIAZIDEDIURETIC | BENDROFLUAZIDE | HCTZ | METOLAZONE | CHLORHALIDONE | INDAPAMIDE | OTHDIURETIC | SPIRONOLACTONE | AMILORIDE | ACETAZOLAMIDE | !is.na(DIURETICCOMMENT) | DIURRECRUIT == 2 | DIURGOUT %in% 2:4,
DIABETES = DIABETES == 2,
HEART = HEART | ANGINA | HEARTFAILURE | HEARTSURGERY | HEARTATTACK,
CREAT = rowMeans(across(c(SCREAT, MRCREAT))) / 88.42,
EGFR = case_when(SEX == "Male" ~ 175 * (CREAT ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (CREAT ^ -1.154) * (AGECOL ^ -0.203) * 0.742),
KIDNEY = KIDNEY | !is.na(KIDNEY2) | EGFR < 60,
LIPIDS = LIPIDS | LIPIDLOWER | STATIN | FIBRATES | EZETIMIBE | NICOTINICACID | BILEACIDSEQ,
STROKE = STROKE,
TOTALALC = rowSums(across(c(BEER, WINE, SPIRITS)), na.rm = TRUE),
SUGDRINK = SUGDRINK,
CURSMOKE = SMOKER == 2,
FAMGOUT = FAMGOUT,
FAMGOUTNUM = FAMGOUT4) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# AGRIA.
# Making temporary phenotype file for the AGRIA cohort based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "AGRIA")
# Extracting IDs from the SNPmax AGRIA phenotype file based on the CoreExome file. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
agria_pheno <- read_delim(here("Data/Phenotypes/AGRIAPheno.txt"), delim = "\t") %>%
filter(PATIENT %in% tmp$Pheno.SampleID) %>%
left_join(tmp, by = c("PATIENT" = "Pheno.SampleID")) %>%
rename(IID = PATIENT,
GOUT = Pheno.GoutSummary) %>%
mutate(across(where(is_character), factor),
across(c(DIABETES:KIDNEY, ALLOP, PROBEN, STEROID, ANTIINFLAM, COLCHI, GPGOUT:SUSTOPHUS, FAMGOUT, FAMGOUT3, FOOD, FULLCAUGTAFF:TOPHIGOUT, URATELOWERING),
logicfactor),
TOPHUS = case_when(TOPHUS == 2 ~ TRUE,
TOPHUS == 1 ~ FALSE),
SSBCODE = factor(SSBCODE,
levels = 0:5,
labels = c("0/day", "0.1 - 0.99", "1.0 - 1.99", "2.0 - 2.99", "3.0 - 3.99", "4.0 +")),
FRUITCODE = factor(FRUITCODE,
levels = 0:5,
labels = c("0/day", "0.1 - 0.99", "1.0 - 1.99", "2.0 - 2.99", "3.0 - 3.99", "4.0 +")),
DIURETICSUMMARY = factor(DIURETICSUMMARY,
levels = 1:3,
labels = c("Not taking diuretics", "Taking diuretics", "Maybe taking diuretics")),
SEX = Geno.GeneticSex,
AGESERUM = round(as.duration(interval(DOB, SERUMDATE)) / as.duration(years(1)),
digits = 0),
AGESCL = round(as.duration(interval(DOB, SCLDATE)) / as.duration(years(1)),
digits = 0),
AGE1ATK = case_when(is.na(AGEGOUTDOX) ~ round(as.duration(interval(DOB, GOUTDOXDATE)) / as.duration(years(1)),
digits = 0),
TRUE ~ AGEGOUTDOX),
DURATION = AGECOL - AGE1ATK + 1,
TOPHIGOUT = case_when(COMMENT %in% c("No information, neither tophaceous or aspirate proven, Deceased",
"No information, neither tophaceous or aspirate proven",
"Gout, no tophi",
"No information, neither tophaceous or aspirate proven, lymphoma") ~ FALSE,
TRUE ~ TOPHUS | GOUTCRITERIAB | SUSTOPHUS | COMMENT %in% c("Tophaceous",
"Urate crystals present, tophaceous",
"Aspirate proven, tophacous",
"allopurinol intolerant, febuxostat intolerant, taking benzobromarone. Urate crystals present, tophacious",
"Tophaceous gout",
"Polyarticular tophaceous gout",
"Chronic tophaceous gout")),
EROSIONS = NA,
NUMATK = NA,
URATE1 = round(URATE * 1000 / 59.48, digits = 1),
URATEAGE1 = AGESERUM,
URATE2 = round(SURICACID_SCL * 1000 / 59.48, digits = 1),
URATEAGE2 = AGESCL,
URATE = case_when(!is.na(URATE1) ~ URATE1,
TRUE ~ URATE2),
ULT = case_when(is.na(URATE1) & !is.na(URATE2) ~ NA,
TRUE ~ ALLOP | PROBEN | COMMENT %in% c("allopurinol intolerant, febuxostat intolerant, taking benzobromarone. Urate crystals present, tophacious",
"Allopurinol hypersensitivity, Cholchicine induced diarrhoea, Febuxostat 40mg/day",
"febuxostat 40mg/day; liver toxicity with allopurinol")),
PROPHY = STEROID | ANTIINFLAM | COLCHI,
HYPERTENSION = case_when(!is.na(HIBP) ~ HIBP,
TRUE ~ DIURETICINDUCED == "Yes" | DIURETICSUMMARY == "Maybe taking diuretics"),
TRIGLY = TRIGLY_SCL * 88.57,
CHOLES = CHOLES_SCL * 38.67,
STROKE = NA,
HDL = HDL_SCL * 38.67,
CREAT = CREAT / 88.42,
SCREAT = SCREAT / 88.42,
CREAT2 = rowMeans(across(c(CREAT, SCREAT)), na.rm = TRUE),
EGFR = case_when(SEX == "Male" ~ 175 * (CREAT2 ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (CREAT2 ^ -1.154) * (AGECOL ^ -0.203) * 0.742),
KIDNEY = KIDNEY | !is.na(KIDNEY2) | EGFR < 60,
TOTALALC = rowSums(across(c(BEER, WINE, SPIRITS)), na.rm = TRUE),
CURSMOKE = NA,
FAMGOUT = FAMGOUT | FAMGOUT3,
FAMGOUTNUM = as.numeric(FAMGOUT4)) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Diabetes Mellitus (DM).
# Making temporary phenotype file for the DM cohort based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "Diabetes Mellitus")
# Extracting IDs from the SNPmax DM phenotype file based on the CoreExome file. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
dm_pheno <- read_delim(here("Data/Phenotypes/DMPheno.txt"), delim = "\t") %>%
filter(PATIENT %in% tmp$Pheno.SampleID) %>%
select(PATIENT, DOB, DATECOL, AGECOL, DIABETES:DIABETESTREAT, FAMGOUT:HIBPTREAT, LIPIDS, HEART:STROKE, KIDNEY:KIDNEY2, SUGDRINK, SMOKER:OTHALCO, WEIGHT, HEIGHT, BMI, URATE:CREAT, DIURETIC:OTHDIURETIC, LIPIDLOWER:BILEACIDSEQ, COMMENT, GOUTCRITERIAB, SUSTOPHUS:OTHDRUG, URATEDOX:DATEDOX, DIABETESAFFSTAT, KIDNEYTRANSPLANT, RENALDISEASE, FASTING:TRIGLY, SURICACID:EGFR) %>%
left_join(tmp, by = c("PATIENT" = "Pheno.SampleID")) %>%
rename(IID = PATIENT,
GOUT = Pheno.GoutSummary,
SEX = Geno.GeneticSex) %>%
mutate(across(where(is_character), factor),
across(c(DIABETES, FAMGOUT, FAMGOUT3, HIBP, LIPIDS, HEART:STROKE, KIDNEY, DIURETIC:OTHDIURETIC, LIPIDLOWER:BILEACIDSEQ, GOUTCRITERIAB, SUSTOPHUS, TOPHUS, ALLOP:COLCHI, DIABETESAFFSTAT, KIDNEYTRANSPLANT, RENALDISEASE), logicfactor),
DURATION = AGECOL - AGE1ATK + 1,
TOPHIGOUT = TOPHUS | GOUTCRITERIAB | SUSTOPHUS,
EROSIONS = NA,
URATE = case_when(!is.na(SURICACID) ~ SURICACID * 1000 / 59.48,
!is.na(URATE) ~ URATE * 1000 / 59.48,
TRUE ~ URATEDOX * 1000 / 59.48),
ULT = ALLOP | PROBEN,
PROPHY = STEROID | ANTIINFLAM | COLCHI | OTHDRUG != "no",
HEIGHT = HEIGHT / 100,
BMI = case_when(!is.na(BMI) ~ BMI,
TRUE ~ WEIGHT / (HEIGHT * HEIGHT)),
HYPERTENSION = HIBP | !is.na(HIBPTREAT) | DIURETIC | LOOPDIURETIC | THIAZIDEDIURETIC | OTHDIURETIC | DIURGOUT,
DIABETES = DIABETES | !is.na(DIABETESTREAT) | DIABETESAFFSTAT,
HEART = HEART | ANGINA | HEARTFAILURE | HEARTSURGERY | HEARTATTACK,
CREAT = CREAT / 88.42,
SCREAT = SCREAT / 88.42,
CREAT2 = rowMeans(across(c(CREAT, SCREAT)), na.rm = T),
EGFR = case_when(SEX == "Male" ~ 175 * (CREAT2 ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (CREAT2 ^ -1.154) * (AGECOL ^ -0.203) * 0.742,
TRUE ~ EGFR),
KIDNEY = KIDNEY | !is.na(KIDNEY2) | EGFR < 60 | KIDNEYTRANSPLANT | RENALDISEASE,
LIPIDS = LIPIDS | LIPIDLOWER | STATIN | FIBRATES | EZETIMIBE | NICOTINICACID | BILEACIDSEQ,
STROKE = STROKE,
TOTALALC = rowSums(across(c(BEER, WINE, SPIRITS)), na.rm = TRUE),
SUGDRINK = SUGDRINK,
CURSMOKE = SMOKER == 2,
FAMGOUT = FAMGOUT | FAMGOUT3,
FAMGOUTNUM = FAMGOUT4) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# LPA.
# Making temporary phenotype file for the LPA cohort based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "LPA")
# Extracting IDs from the SNPmax LPA phenotype file based on the CoreExome file. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
lpa_pheno <- read_delim(here("Data/Phenotypes/LPAPheno.txt"), delim = "\t") %>%
filter(SUBJECT %in% tmp$Pheno.SampleID) %>%
select(SUBJECT:AGE, SMOKING, SMOKEHISTORY, SUGARDRINKS:DIABETESTYPE, MAINHYPERTENSION:DYSLIPIDCOMMENT, MAINSTROKE:MAINSTROKECOM, BMHEIGHT:BMWEIGHT, SERUMCREATININE:SERUMURATE, TOTALCHOLESTEROL, TRIGLYCERIDES) %>%
left_join(tmp, by = c("SUBJECT" = "Pheno.SampleID")) %>%
rename(IID = SUBJECT,
GOUT = Pheno.GoutSummary,
SEX = Geno.GeneticSex,
AGECOL = AGE) %>%
mutate(across(where(is_character), factor),
across(c(SMOKING:SMOKEHISTORY, MAINDIABETES, MAINHYPERTENSION, DYSLIPIDEMIA),
logicfactor),
AGE1ATK = NA,
DURATION = NA,
NUMATK = NA,
TOPHIGOUT = NA,
EROSIONS = NA,
URATE = SERUMURATE * 1000 / 59.48,
ULT = NA,
PROPHY = NA,
HEIGHT = BMHEIGHT / 100,
BMI = BMWEIGHT / (HEIGHT * HEIGHT),
HYPERTENSION = MAINHYPERTENSION,
DIABETES = DIABETESTYPE == 2,
HEART = NA,
CREAT = SERUMCREATININE / 88.42,
EGFR = case_when(SEX == "Male" ~ 175 * (SERUMCREATININE ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (SERUMCREATININE ^ -1.154) * (AGECOL ^ -0.203) * 0.742),
KIDNEY = EGFR < 60,
LIPIDS = DYSLIPIDEMIA,
STROKE = MAINSTROKE,
TOTALALC = NA,
SUGDRINK = SUGARDRINKS,
CURSMOKE = SMOKING,
FAMGOUT = NA,
FAMGOUTNUM = NA) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Ngati Porou (NPH).
# Making temporary phenotype file for the NPH cohort based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "Ngati Porou")
# Extracting IDs from the SNPmax NPH phenotype file based on the CoreExome file. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
nph_pheno <- read_delim(here("Data/Phenotypes/NPHPheno.txt"), delim = "\t") %>%
filter(PATIENT %in% tmp$Pheno.SampleID) %>%
select(PATIENT, DOB, CONSENT, DATEARR, AGECOL, DIABETES, FAMGOUT:HIBP, LIPIDS, LIPIDLOWER:STROKE, KIDNEY, SUGDRINK, SMOKER:SPIRITS, WEIGHT:HEIGHT, BMI, URATE:CREATDATE, DIURETICCURRENT:FRUSEMIDE, BUMETANIDE, BENDROFLUAZIDE, HCTZ, METOLAZONE, CHLORHALIDONE, SPIRONOLACTONE, AMILORIDE, COMMENT, GOUTCRITERIAB, SUSTOPHUS, AGE1ATK:ALLOP, STEROID:OTHDRUG, URATEDOX:DATEDOX, RENALTRANSPLANT, DIABETESAFFSTAT, SURICACID:SCREAT, DIURETIC:OTHDIURETIC, STATIN:BILEACIDSEQ, URATELOWERING) %>%
left_join(tmp, by = c("PATIENT" = "Pheno.SampleID")) %>%
rename(IID = PATIENT,
GOUT = Pheno.GoutSummary,
SEX = Geno.GeneticSex) %>%
mutate(across(where(is_character), factor),
across(c(DIABETES, FAMGOUT, FAMGOUT3, HIBP:STROKE, KIDNEY, DIURETICCURRENT:AMILORIDE, GOUTCRITERIAB, SUSTOPHUS, TOPHUS, ALLOP:BENZOBROMARONE, RENALTRANSPLANT, DIABETESAFFSTAT, DIURETIC:URATELOWERING),
logicfactor),
AGE1ATK = case_when(!is.na(AGE1ATK) ~ AGE1ATK,
TRUE ~ AGECOL - DURATION),
DURATION = AGECOL - AGE1ATK + 1,
NUMATK = NUMATK,
TOPHIGOUT = GOUTCRITERIAB | SUSTOPHUS | TOPHUS,
EROSIONS = NA,
URATE = case_when(is.na(SURICACID) ~ rowMeans(across(c(URATE, URATEDOX)), na.rm = TRUE) * 1000 / 59.48,
TRUE ~ SURICACID * 1000 / 59.48),
ULT = ALLOP | PROBEN | BENZOBROMARONE | URATELOWERING,
PROPHY = STEROID | ANTIINFLAM | COLCHI,
HEIGHT = HEIGHT / 100,
BMI = WEIGHT / (HEIGHT * HEIGHT),
HYPERTENSION = HIBP | DIURETICCURRENT | FRUSEMIDE | BUMETANIDE | BENDROFLUAZIDE | HCTZ | METOLAZONE | CHLORHALIDONE | SPIRONOLACTONE | AMILORIDE | DIURGOUT %in% 2:4 | DIURETIC | LOOPDIURETIC | THIAZIDEDIURETIC | OTHDIURETIC,
DIABETES = DIABETES | DIABETESAFFSTAT,
HEART = HEART | ANGINA | HEARTFAILURE | HEARTSURGERY | HEARTATTACK,
CREAT = rowMeans(across(c(CREAT, SCREAT)), na.rm = TRUE) / 88.42,
EGFR = case_when(SEX == "Male" ~ 175 * (CREAT ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (CREAT ^ -1.154) * (AGECOL ^ -0.203) * 0.742),
KIDNEY = KIDNEY | EGFR < 60 | RENALTRANSPLANT,
LIPIDS = LIPIDS | LIPIDLOWER | STATIN | FIBRATES | EZETIMIBE | NICOTINICACID | BILEACIDSEQ,
STROKE = STROKE,
TOTALALC = rowSums(across(c(BEER, WINE, SPIRITS)), na.rm = TRUE),
SUGDRINK = SUGDRINK,
CURSMOKE = SMOKER == 2,
FAMGOUT = FAMGOUT,
FAMGOUTNUM = FAMGOUT4) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Renal Disease (RD).
# Making temporary phenotype file for the RD cohort based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "Renal Disease")
# Extracting IDs from the SNPmax RD phenotype file based on the CoreExome file. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
rd_pheno <- read_delim(here("Data/Phenotypes/RDPheno.txt"), delim = "\t") %>%
filter(PATIENT %in% tmp$Pheno.SampleID) %>%
select(PATIENT, DOB, CONSENTDATE, DATECOL, DATEARR, CKDV, RENALTRANSPLANT, DIABETES, FAMGOUT, HYPERTENSION, DYSLIPIDAEMIA, IHD, CVA, CHF, HEALTHOTH:WEIGHT, BMI, SMOKER, SUGDRINK, BEER:SPIRITS, COMMENT, TYPE2D, GOUTCRITERIAB, SUSTOPHUS, AGE1ATK:OTHDRUG, ESSENTIALHYPERT, SURICACID:SCREAT, RCOMMENTS) %>%
left_join(tmp, by = c("PATIENT" = "Pheno.SampleID")) %>%
rename(IID = PATIENT,
GOUT = Pheno.GoutSummary,
SEX = Geno.GeneticSex) %>%
mutate(across(where(is_character), factor),
across(c(RENALTRANSPLANT:CHF, TYPE2D:SUSTOPHUS, TOPHUS, ALLOPURINOL:RASBURICASE),
logicfactor),
AGECOL = AGECOL,
AGE1ATK = AGE1ATK,
DURATION = AGECOL - AGE1ATK + 1,
NUMATK = NA,
TOPHIGOUT = GOUTCRITERIAB | SUSTOPHUS | TOPHUS,
EROSIONS = NA,
URATE = case_when(is.na(SURICACID) ~ rowMeans(across(c(URATEFIRSTREC, URATEDOX, URATERECENT)), na.rm = TRUE) * 1000 / 59.48,
TRUE ~ SURICACID * 1000 / 59.48),
ULT = ALLOPURINOL | PROBEN | RASBURICASE,
PROPHY = STEROID | ANTIINFLAM | COLCHI,
HEIGHT = HEIGHT / 100,
BMI = WEIGHT / (HEIGHT * HEIGHT),
HYPERTENSION = HYPERTENSION | ESSENTIALHYPERT == 1 | DIURGOUT %in% 2:4,
DIABETES = DIABETES | TYPE2D,
HEART = IHD | CHF,
EGFR = case_when(SEX == "Male" ~ 175 * (SCREAT / 88.42) ^ -1.154 * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (SCREAT / 88.42) ^ -1.154 * (AGECOL ^ -0.203) * 0.742),
KIDNEY = CKDV == 1 | RENALTRANSPLANT | EGFR < 60,
LIPIDS = DYSLIPIDAEMIA,
STROKE = CVA,
TOTALALC = rowSums(across(c(BEER, WINE, SPIRITS)), na.rm = TRUE),
SUGDRINK = SUGDRINK,
CURSMOKE = SMOKER == 2,
FAMGOUT = FAMGOUT,
FAMGOUTNUM = NA) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# EuroGout.
# Making temporary phenotype file for the EuroGout cohort based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "EuroGout")
# Extracting IDs from the SNPmax EuroGout phenotype file based on the CoreExome file. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
eurogout_pheno <- read_delim(here("Data/Phenotypes/EuroGoutPheno.txt"), delim = "\t", guess_max = 5000) %>%
filter(SUBJECT %in% tmp$Pheno.SampleID) %>%
select(SUBJECT, RECRUITMENTDATE, DOB:WEIGHT, HEIGHT, BMI, TOPHUS:GOUTNOTES, ACRB, ACRC8, RENALDISEASE, T2DIABETES:HEARTFAILURE, MEDICALCOMMENT, URATETHERAPY:ALLOPURINOL, CHOLCHICINE:TLDIURETICS, ASPRIN, SUGARDRINK:FRUITJUICE, ALCOHOL:PREUTLKURATE, TCHOLESTEROL:TRIGLYCERIDES, EGFR) %>%
left_join(tmp, by = c("SUBJECT" = "Pheno.SampleID")) %>%
rename(IID = SUBJECT,
GOUT = Pheno.GoutSummary,
AGECOL = AGERECRUITMENT,
SEX = Geno.GeneticSex) %>%
mutate(across(where(is_character), factor),
across(c(TOPHUS, EROSIONS, ACRB, ACRC8, RENALDISEASE, T2DIABETES, HYPERTENSION, DYSLIPIDEMIA, STROKE:HEARTFAILURE, ALLOPURINOL:ASPRIN),
logicfactor2),
AGE1ATK = case_when(!is.na(AGEFIRSTATTK) ~ AGEFIRSTATTK,
TRUE ~ AGECOL - DURATIONGOUT),
DURATION = AGECOL - AGE1ATK + 1,
NUMATK = case_when(!is.na(NUMATTACKS) ~ NUMATTACKS,
NUMATTACKS_TXT == ">5" ~ 5,
NUMATTACKS_TXT == "1" ~ 1,
NUMATTACKS_TXT == "2" ~ 2,
NUMATTACKS_TXT == "3" ~ 3,
NUMATTACKS_TXT %in% c("3 to 5", "3-5") ~ 4,
NUMATTACKS_TXT %in% c("reported 'continue' I think. I assume this means ongoing.", "reported 100.") ~ 52,
NUMATTACKS_TXT == "zehn" ~ 10),
TOPHIGOUT = TOPHUS | NUMTOPHI %in% 1:3 | ACRB | ACRC8,
URATE = case_when(is.na(SERUMURATE) ~ PREUTLKURATE * 1000 / 59.48,
TRUE ~ SERUMURATE * 1000 / 59.48),
ULT = GOUTNOTES == "Gout assumed, taking allopurinol" | (!is.na(URATETHERAPY) & !(URATETHERAPY %in% c("diet", "NIL", "no", "No uric acid lowering therapy", "none", "None", "NONE", "none listed", "Unclear"))) | ALLOPURINOL,
PROPHY = CHOLCHICINE | NSAIDS | ASPRIN,
HEIGHT = HEIGHT / 100,
BMI = case_when(!is.na(BMI) ~ BMI,
TRUE ~ WEIGHT / (HEIGHT * HEIGHT)),
HYPERTENSION = HYPERTENSION | !is.na(HYPERTENTREATM) | MEDICALCOMMENT == "Said no to hypertension but beside BP states is on losartan" | DIURETICS | TLDIURETICS,
DIABETES = T2DIABETES | !is.na(T2DTREATMENT),
HEART = MI | IHD | HEARTFAILURE | MEDICALCOMMENT %in% c("Cardiovascular disease", "Heart problems", "Heart problems. EGFR available", "Heart problems. EGFR available. EGFR available. EGFR<60"),
CREAT = SERUMCREATININE / 88.42,
EGFR = case_when(SEX == "Male" ~ 175 * (CREAT ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (CREAT ^ -1.154) * (AGECOL ^ -0.203) * 0.742,
TRUE ~ EGFR),
KIDNEY = RENALDISEASE | EGFR < 60,
LIPIDS = DYSLIPIDEMIA | !is.na(LIPIDTREATMENT),
STROKE = STROKE,
TOTALALC = ALCOHOL,
SUGDRINK = SUGARDRINK + FRUITJUICE,
CURSMOKE = SMOKER == 1,
FAMGOUT = FAMILYHISTORY == 1,
FAMGOUTNUM = NUMFAMILYGOUT) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Ardea - Ironwood (CLEAR 1, CLEAR 2, CRYSTAL, and LIGHT).
# Making temporary phenotype file for the Ironwood cohorts based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study %in% c("Ardea: CLEAR1", "Ardea: CLEAR2", "Ardea: CRYSTAL", "Ardea: LIGHT"))
# Extracting IDs from the SNPmax Ironwood phenotype file based on the CoreExome file. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
ironwood_pheno <- read_delim(here("Data/Phenotypes/ArdeaPheno.txt"), delim = "\t", guess_max = 5000) %>%
filter(SUBJID %in% tmp$Pheno.SampleID) %>%
select(SUBJID, AGE, BRTHDTC, BLWEIGHT, BLHEIGHT, BLBMI, TRT01AN, CONSDT, TRTSDT, ANGINA:HYPERTRIGLY, MI, STROKE, AGFIDDT:GFDUR, CRITBFL, PHNM8FL, ULTALLO:ULTOTH, PLACTOTSTDT:PLACTOTENDT, THIALKFL:PROPHTYPN, TOPHIFN:BLAREA, GFNUM:GFNUMGR, DATESCREENING:EGFRSCREENING, CHOLSCREENING, TRIGSCREENING, URATESCREENING, DATENEG7, URATENEG7, EGFRNEG7, DATEBASELINE, URATEBASELINE, EGFRBASELINE, DATEMONTH1, URATEMONTH1, DATEMONTH2, URATEMONTH2, DATEMONTH3, URATEMONTH3, DATEMONTH4, URATEMONTH4, DATEMONTH5, URATEMONTH5, DATEMONTH6, URATEMONTH6, DATEMONTH8, URATEMONTH8, DATEMONTH10, URATEMONTH10, DATEMONTH12, URATEMONTH12, DATEEARLYTERM, URATEEARLYTERM, DATEFOLLOWUP, URATEFOLLOWUP, CURSMOKE:ALCOHOL, TOPHIGOUT:GOUTNOTES) %>%
left_join(tmp, by = c("SUBJID" = "Pheno.SampleID")) %>%
mutate(across(where(is_character), factor),
across(c(ANGINA:STROKE, CRITBFL:ULTOTH, THIALKFL:PROPHYFL, TOPHIFN, CURSMOKE:ALCOHOL),
logicfactor2),
TRT01AN = factor(TRT01AN,
levels = 0:5,
labels = c("Screen Failure", "Group A (Placebo)", "Group B (Lesinurad 200 mg)", "Group C (Lesinurad 400 mg)", "Not Assigned", "Not Treated"))) %>%
rename(IID = SUBJID,
GOUT = Pheno.GoutSummary,
AGECOL = AGE) %>%
mutate(SEX = Geno.GeneticSex,
AGE1ATK = round(as.duration(interval(ymd(BRTHDTC, truncated = 2L), AGFIDDT)) / as.duration(years(1)),
digits = 0),
DURATION = AGECOL - AGE1ATK + 1,
TOPHIGOUT = TOPHIFN,
EROSIONS = NA,
NUMATK = GFNUM,
URATE = URATESCREENING,
ULT = ULTALLO | ULTPROB | ULTFEBU | ULTOTH | Pheno.Study %in% c("Ardea: CLEAR1", "Ardea: CLEAR2") | (Pheno.Study == "Ardea: CRYSTAL" & URATE < 8),
PROPHY = PROPHYFL,
BMI = BLBMI,
HEART = HEARTFAILURE | MI | ANGINA,
KIDNEY = EGFRSCREENING < 60,
LIPIDS = HYPERCHOLESTEROL | HYPERTRIGLY,
TOTALALC = NA,
SUGDRINK = NA,
FAMGOUT = NA,
FAMGOUTNUM = NA,
EGFR2 = case_when(SEX == "Male" ~ 175 * (SCRSCREENING ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (SCRSCREENING ^ -1.154) * (AGECOL ^ -0.203) * 0.742)) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Ardea - LASSO.
# Making temporary phenotype file for the Ironwood cohorts based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "Ardea: 401")
# Reading in three LASSO cohort phenotype files.
lassopheno1 <- read_delim(here("Data/Phenotypes/ArdeaLassoPhenoFlare.txt"), delim = "\t", guess_max = 5000)
lassopheno2 <- read_delim(here("Data/Phenotypes/ArdeaLassoPhenoLabChem.txt"), delim = "\t", guess_max = 5000)
lassopheno3 <- read_delim(here("Data/Phenotypes/ArdeaLassoPhenoMain.txt"), delim = "\t", guess_max = 5000)
# Combining all three phenotype files together.
tmp1 <- full_join(lassopheno3, lassopheno2, by = "SUBJID")
lasso_pheno <- full_join(tmp1, lassopheno1, by = "SUBJID")
# Cleaning up.
rm(lassopheno1, lassopheno2, lassopheno3, tmp1)
# Extracting IDs from the SNPmax LASSO phenotype file based on the CoreExome file. Converting the DNAID into a character variable named IID. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
lasso_pheno <- lasso_pheno %>%
filter(DNAID %in% tmp$Pheno.SampleID) %>%
mutate(IID = as.character(DNAID)) %>%
select(IID, AGE, BRTHDTC, GFNUM:ULTOSCR, BLBMI:PROPHTYP, AGFIDDT:GFDUR, TOLOCL:GFDTDURL, ANGINA:RENALIMPAIR, MI:STROKE, SCREENGFSTDT, SCREENGFENDT, SCREENGFOUT, SCREENGFSEV, SCREENPAIN, SCREENGFDUR:SCREENPAIN2, SCREENGFSTRSTP, SCREENLBDT, SCREENALT, SCREENCREAT, SCREENGGT, SCREENURATE, BASELINELBDT, BASELINEURATE, BASET1LBDT, BASET1URATE, BASET2LBDT, BASET2URATE, BASET3LBDT, BASET3URATE, MONTH1LBDT, MONTH1URATE, MONTH1T1LBDT, MONTH1T1URATE, MONTH1T2LBDT, MONTH1T2URATE, MONTH2LBDT, MONTH2URATE, MONTH2T1LBDT, MONTH2T1URATE, MONTH3LBDT, MONTH3URATE, MONTH3T1LBDT, MONTH3T1URATE, MONTH3T3LBDT, MONTH3T3URATE, MONTH4LBDT, MONTH4URATE, MONTH4T1LBDT, MONTH4T1URATE, MONTH5LBDT, MONTH5URATE, MONTH6LBDT, MONTH6URATE, UNSCHEDLBDT, UNSCHEDURATE, EARLYTERMLBDT, EARLYTERMURATE) %>%
left_join(tmp, by = c("IID" = "Pheno.SampleID")) %>%
rename(GOUT = Pheno.GoutSummary,
AGECOL = AGE,
NUMATK = GFNUM,
SEX = Geno.GeneticSex,
BMI = BLBMI) %>%
mutate(across(where(is_character), factor),
across(c(TOHANDFL:ULTOSCR, BLCDFL, ANGINA:RENALIMPAIR, MI:STROKE),
logicfactor2),
AGE1ATK = round(as.duration(interval(ymd(BRTHDTC, truncated = 2L), AGFIDDT)) / as.duration(years(1)),
digits = 0),
DURATION = AGECOL - AGE1ATK + 1,
TOPHIGOUT = BLTPHFN,
EROSIONS = NA,
URATE = SCREENURATE,
ULT = ALLOSCR | ULTOSCR | SCREENURATE < 8,
PROPHY = PROPHTYP %in% c("Both", "Colchicine", "NSAID"),
HEART = ANGINA | MI,
EGFR = case_when(SEX == "Male" ~ 175 * (SCREENCREAT ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (SCREENCREAT ^ -1.154) * (AGECOL ^ -0.203) * 0.742),
KIDNEY = RENALIMPAIR | EGFR < 60,
LIPIDS = HYPERCHOLESTEROL | HYPERTRIGLY,
TOTALALC = NA,
SUGDRINK = NA,
CURSMOKE = NA,
FAMGOUT = NA,
FAMGOUTNUM = NA) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Combining phenotypes for all cohorts together. Additionally, deduplicating the IID column so as to keep only unique individuals.
all_pheno <- rbind(agria_pheno, aotearoa_pheno, dm_pheno, eurogout_pheno, ironwood_pheno, lasso_pheno, lpa_pheno, nph_pheno, rd_pheno) %>%
mutate(Pheno.Study = factor(Pheno.Study)) %>%
arrange(IID) %>%
filter(!(duplicated(IID) | duplicated(IID, fromLast = TRUE)))
# Cleaning up.
rm(aotearoa_pheno, agria_pheno, dm_pheno, nph_pheno, rd_pheno, eurogout_pheno, ironwood_pheno, lasso_pheno, tmp, lpa_pheno, CoreExPheno_Final)
# Extracting phenotype information for the UK Biobank cohort.
# Saving location of the phenotype data as a variable.
ukbb_dir <- path("/Volumes/archive/merrimanlab/raid_backup/UKbiobank/")
# Saving location of the coding files for phenotypes.
codings_dir <- path(here("Data/codings"))
# Reading in list of withdrawn IDs.
withdrawn <- read_csv(path(ukbb_dir, "w12611_20210201.new.csv"), col_names = "id")
# Loading UKBB phenotype file from 2019.
load(path(ukbb_dir,'decrypted_files/ukb27189_27190_27191_27192_27193_27194_27195_27640_30070_31460_combined_withdrawn_ids_removed_10-07-2019.RData'))
# Renaming working dataset and removing withdrawn and missing IDs (leaves 488,247 individuals).
bd_refined <- refresh_ukbb_data %>%
filter(!is.na(eid) & !(eid %in% withdrawn$id)) # Total 488,247
# Reading coding document for self-reported diseases.
datacoding6 <- read_delim(path(codings_dir, "coding6.tsv"), delim = "\t")
# Reading coding document for self-reported medication.
datacoding4 <- read_delim(path(codings_dir, "coding4.tsv"), delim = "\t")
# Reading Nicola Dalbeth annotated datacoding4 file.
nd_drugs <- read_csv(path(ukbb_dir, "UKBio_drugs_ND_formatted.csv"),
col_names = TRUE,
col_types = cols(.default = "c"))
# Reading coding document for ICD10 (hospital diagnoses).
datacoding19 <- read_delim(path(codings_dir, "coding19.tsv"), delim = "\t")
# Cleaning up.
rm(withdrawn, codings_dir)
# Making variables for the coding of each self-report phenotype of interest.
gout_code <- datacoding6 %>%
filter(str_detect(meaning, "gout")) %>%
pull(coding) %>%
as.character()
hypertension_code1 <- "1072"
hypertension_code2 <- "1065"
diabetes_code1 <- "1220"
diabetes_code2 <- "1222"
diabetes_code3 <- "1223"
cholesterol_code <- "1473"
mi_code <- "1075"
angina_code <- "1074"
heartfail_code <- "1076"
renal_code <- "1192"
liver_code <- "1158"
stroke_code1 <- "1081"
stroke_code2 <- "1583"
# Extracting true/false columns for each of the above codes. First selecting only the 102 columns that contain self-report non-cancer illness data (+ the eid column). Next concatenating all columns into a single column with each value separated by an "_". Removing all "NA_" or "_NA" strings from all observations.
self_report_illness_data <- bd_refined %>%
select(eid, contains('f20002_')) %>%
unite("self_report", contains("f20002_"), sep = "_") %>%
mutate(self_report_illness = str_remove_all(self_report, "NA_|_NA"),
gout_self_report = str_detect(self_report_illness, gout_code),
hypertension_self_report1 = str_detect(self_report_illness, hypertension_code1),
hypertension_self_report2 = str_detect(self_report_illness, hypertension_code2),
diab_self_report = str_detect(self_report_illness, diabetes_code1),
t1d_self_report = str_detect(self_report_illness, diabetes_code2),
t2d_self_report = str_detect(self_report_illness, diabetes_code3),
chol_self_report = str_detect(self_report_illness, cholesterol_code),
mi_self_report = str_detect(self_report_illness, mi_code),
angina_self_report = str_detect(self_report_illness, angina_code),
heartfailure_self_report = str_detect(self_report_illness, heartfail_code),
renalfailure_self_report = str_detect(self_report_illness, renal_code),
liver_self_report = str_detect(self_report_illness, liver_code),
stroke_self_report1 = str_detect(self_report_illness, stroke_code1),
stroke_self_report2 = str_detect(self_report_illness, stroke_code2)) %>%
select(eid, gout_self_report, hypertension_self_report1, hypertension_self_report2, diab_self_report, t1d_self_report, t2d_self_report, chol_self_report, mi_self_report, angina_self_report, heartfailure_self_report, renalfailure_self_report, liver_self_report, stroke_self_report1, stroke_self_report2)
# Cleaning up.
rm(datacoding6, gout_code, hypertension_code1, hypertension_code2, diabetes_code1, diabetes_code2, diabetes_code3, cholesterol_code, mi_code, angina_code, heartfail_code, renal_code, liver_code, stroke_code2, stroke_code1)
# Defining codes for various drugs.
colchicine_codes <- nd_drugs %>%
filter(Colchicine == "y") %>%
pull(coding)
allopurinol_codes <- nd_drugs %>%
filter(Allopurinol == "y") %>%
pull(coding)
sulphinpyrazone_codes <- nd_drugs %>%
filter(Sulphinpyrazone == 'y') %>%
pull(coding)
probenecid_codes <- nd_drugs %>%
filter(Probenecid == "y") %>%
pull(coding)
diuretics_codes <- nd_drugs %>%
filter(Diuretic == 'y' ) %>%
pull(coding)
betablocker_codes <- c("1140916342",
"1140866738",
"1140879818",
"1140879854",
"1140909368",
"1140879760")
antihyper_code <- "1140888578"
ace_inhib_codes <- c("1140860750",
"1140888552",
"1140860696",
"1140860806")
atiir_antag_codes <- c("1141156836",
"1141145660",
"1140916356")
calc_block_codes <- c("1140879806",
"1140888510",
"1140879802")
insulin_code <- "1140883066"
metformin_code <- "1140884600"
statin_codes <- c("1141146234",
"1140888594",
"1140888648",
"1141192410",
"1140861958")
ezetimibe_code <- "1141192736"
fenofibrate_code <- "1140861954"
cortico_prednisone_codes <- nd_drugs %>%
filter(`Corticosteroids/prednisone` == 'y' ) %>%
pull(coding)
nsaids_codes <- nd_drugs %>%
filter(NSAIDs == 'y') %>%
pull(coding)
control_drug_exclusion_codes <- nd_drugs %>%
filter(Allopurinol == 'y' | Colchicine == 'y' | Febuxostat == 'y' | Sulphinpyrazone == 'y' | Probenecid == 'y' | `Corticosteroids/prednisone` == 'y' | NSAIDs == 'y') %>%
pull(coding)
# Making TRUE/FALSE columns for each drug.
self_report_med <- bd_refined %>%
select(eid, contains("f20003_")) %>%
unite("medications", contains("f20003_"), sep = "_") %>%
mutate(colchicine = str_detect(medications, colchicine_codes),
allopurinol = str_detect(medications, paste(allopurinol_codes, collapse = '|')),
sulphinpyrazone = str_detect(medications, paste(sulphinpyrazone_codes, collapse = '|')),
probenecid = str_detect(medications, paste(probenecid_codes, collapse = '|')),
diuretics = str_detect(medications, paste(diuretics_codes, collapse = '|')),
beta_blockers = str_detect(medications, paste(betablocker_codes, collapse = '|')),
antihyper = str_detect(medications, antihyper_code),
ace_inhibitors = str_detect(medications, paste(ace_inhib_codes, collapse = '|')),
atiir_antagonists = str_detect(medications, paste(atiir_antag_codes, collapse = '|')),
calcium_blockers = str_detect(medications, paste(calc_block_codes, collapse = '|')),
insulin = str_detect(medications, insulin_code),
metformin = str_detect(medications, metformin_code),
statin = str_detect(medications, paste(statin_codes, collapse = '|')),
ezetimibe = str_detect(medications, ezetimibe_code),
fenofibrate = str_detect(medications, fenofibrate_code),
afs = ifelse(allopurinol | sulphinpyrazone, TRUE, FALSE),
nsaids = str_detect(medications, paste(nsaids_codes, collapse = '|')),
cortico_prednisone = str_detect(medications, paste(cortico_prednisone_codes, collapse = '|')),
control_exclude_drug_criteria = str_detect(medications, paste(control_drug_exclusion_codes, collapse = '|')),
ULT = ifelse(allopurinol | sulphinpyrazone | probenecid, TRUE, FALSE))
# Cleaning up.
rm(nd_drugs, datacoding4, colchicine_codes, allopurinol_codes, sulphinpyrazone_codes, probenecid_codes, diuretics_codes, nsaids_codes, cortico_prednisone_codes, insulin_code, antihyper_code, statin_codes, control_drug_exclusion_codes, betablocker_codes, metformin_code, ezetimibe_code, fenofibrate_code, ace_inhib_codes, atiir_antag_codes, calc_block_codes)
# Recording gout codes.
icd_gout_codes <- datacoding19 %>%
filter(str_detect(coding, 'M10')) %>%
pull(coding)
# Recording all blood cancer codes.
hospital_blood_cancer_codes <- paste0("C", 81:96)
# Recording all renal disease codes.
rd_icd_10_codes <- c("I12", "I13", paste0("N0", 0:5), "N07", "N11", "N14", paste0("N", 17:19), "Q61", "N250", "Z49", "Z940", "Z992")
# Making table of ICD-10 data.
icd10_data <- bd_refined %>%
select(eid, contains("f41202"), contains("f41204")) %>%
unite("hospital_diagnoses", contains("diagnoses")) %>%
mutate(filtered_codes = str_remove_all(hospital_diagnoses, "NA_|_NA|NA")) %>%
select(eid, hospital_diagnoses, filtered_codes) %>%
mutate(gout_hospital = str_detect(filtered_codes, paste0(icd_gout_codes, collapse = "|")),
hosp_hypertension = str_detect(filtered_codes, paste0(paste0("I",10:15), collapse = "|")),
hosp_t1d = str_detect(filtered_codes, "E10"),
hosp_t2d = str_detect(filtered_codes, "E11"),
hosp_other_diab = str_detect(filtered_codes, "E13"),
hosp_hyperlipidemia = str_detect(filtered_codes, "E78"),
hosp_ischemic_heart = str_detect(filtered_codes, paste0(paste0("I",20:25), collapse = "|")),
hosp_heartfail = str_detect(filtered_codes, "I50"),
hosp_cerebrovascular = str_detect(filtered_codes, paste0(paste0("I",60:69), collapse = "|")),
hosp_general_renal = str_detect(filtered_codes, paste0(rd_icd_10_codes, collapse = "|")),
hosp_ckd_stage3 = str_detect(filtered_codes, "N183"),
hosp_ckd_stage4 = str_detect(filtered_codes, "N184"),
hosp_ckd_stage5 = str_detect(filtered_codes, "N185"),
hosp_ckd_endstage = str_detect(filtered_codes, "N186"),
hosp_liver_disease = str_detect(filtered_codes, paste0(paste0("K",70:77), collapse = "|")),
hosp_other_liver = str_detect(filtered_codes, "K75"),
hospital_ult_exclude = str_detect(filtered_codes, paste0(hospital_blood_cancer_codes, collapse = "|"))
)
# Cleaning up.
rm(datacoding19, hospital_blood_cancer_codes, icd_gout_codes, rd_icd_10_codes)
# Extracting other columns of interest, converting to consistent units and defining high levels of biomarkers.
diagnostics <- bd_refined %>% select(eid, body_mass_ind_bmi_f21001_0_0, glucose_f30740_0_0, triglycerides_f30870_0_0, cholesterol_f30690_0_0, creatinine_f30700_0_0, age_when_attended_assessment_centre_f21003_0_0, s_f31_0_0, alanine_aminotransferase_f30620_0_0, gamma_glutamyltransferase_f30730_0_0, urate_f30880_0_0, alcohol_intake_frequency_f1558_0_0, current_tobacco_smoking_f1239_0_0, contains("ethnic")) %>%
mutate(sex = (as.numeric(s_f31_0_0) - 2) * -1, #converts males to 0, females to 1
age = age_when_attended_assessment_centre_f21003_0_0,
bmi = body_mass_ind_bmi_f21001_0_0,
creat = creatinine_f30700_0_0 / 88.42, #converts to mg/dL
eGFR_f = 175*(creat^-1.154)*(age^-0.203)*0.742,
eGFR_m = 175*(creat^-1.154)*(age^-0.203),
CKD3_f = eGFR_f < 60 & eGFR_f >= 30,
CKD3_m = eGFR_m < 60 & eGFR_m >= 30,
CKD4_f = eGFR_f < 30 & eGFR_f >= 15,
CKD4_m = eGFR_m < 30 & eGFR_m >= 15,
CKD5_f = eGFR_f < 15,
CKD5_m = eGFR_m < 15,
obese = bmi >= 30,
chol = cholesterol_f30690_0_0 * 0.385, #converts to g/L
trig = triglycerides_f30870_0_0 * 0.88, #converts to g/L
gluc = glucose_f30740_0_0 * 0.18, #converts to g/L
hyperchol = chol > 2,
hypertrig = trig > 2.5,
hypergluc = gluc > 2,
alamino = alanine_aminotransferase_f30620_0_0,
gamglut = gamma_glutamyltransferase_f30730_0_0,
hyperalamino_m = alamino > 30,
hyperalamino_f = alamino > 19,
hypergamglut_m = gamglut > 51,
hypergamglut_f = gamglut > 33,
hyperalt = alamino > 110,
hyperggt = gamglut > 100,
urate = urate_f30880_0_0)
# Carefully defining blood pressure measurements based on first instance.
bp <- bd_refined %>%
select(eid, contains("f4079_0"), contains("f4080_0"), contains("f94_0"),contains("f93_0")) %>%
gather("measure", "value", -eid) %>% filter(!is.na(value)) %>%
mutate(measure_num = str_extract(measure, "([0-9])+$")) %>%
mutate(visit = str_remove_all(measure, "(_0$)|(_1$)")) %>%
mutate(bp_type = case_when(str_detect(visit, "systolic") ~ "systolic", str_detect(visit, "diastolic") ~ "diastolic" ),
bp_method = case_when(str_detect(visit, "automated") ~ "auto", str_detect(visit, "manual") ~ "manual")) %>%
mutate(visit_num = str_extract(visit, "_([0-9])+$")) %>%
mutate(visit_num = str_remove_all(visit_num, "_"))
# Calculating hypertension columns.
ht_calc <- bp %>%
select(-visit, -measure) %>%
spread("bp_type", "value") %>%
group_by(eid) %>%
summarise(diastolic_ht = sum(diastolic > 90) >= 2, diastolic_ht_n = sum(diastolic >90),
systolic_ht = sum(systolic > 140) >= 2, systolic_ht_n = sum(systolic > 140),
hypertension_measured = ifelse(diastolic_ht | systolic_ht, TRUE, FALSE))
# Again, carefully defining blood pressure measurements based on second instance.
bp2 <- bd_refined %>%
select(eid, contains("f4079_1"), contains("f4080_1"), contains("f94_1"),contains("f93_1")) %>%
gather("measure", "value", -eid) %>% filter(!is.na(value)) %>%
mutate(measure_num = str_extract(measure, "([0-9])+$")) %>%
mutate(visit = str_remove_all(measure, "(_0$)|(_1$)")) %>%
mutate(bp_type = case_when(str_detect(visit, "systolic") ~ "systolic", str_detect(visit, "diastolic") ~ "diastolic" ),
bp_method = case_when(str_detect(visit, "automated") ~ "auto", str_detect(visit, "manual") ~ "manual")) %>%
mutate(visit_num = str_extract(visit, "_([0-9])+$")) %>%
mutate(visit_num = str_remove_all(visit_num, "_"))
# Calculating hypertension columns.
ht_calc2 <- bp2 %>%
select(-visit, -measure) %>%
spread("bp_type", "value") %>%
group_by(eid) %>%
summarise(diastolic_ht = sum(diastolic > 90) >= 2, diastolic_ht_n = sum(diastolic >90),
systolic_ht = sum(systolic > 140) >= 2, systolic_ht_n = sum(systolic > 140),
hypertension_measured2 = ifelse(diastolic_ht | systolic_ht, TRUE, FALSE))
# Combining columns together.
ht_calc3 <- ht_calc2 %>%
left_join(ht_calc, by = "eid") %>%
mutate(hypertension_measured3 = hypertension_measured2 + hypertension_measured,
hypertension_measured_both = ifelse(hypertension_measured3 == 2, TRUE, FALSE))
# Adding hypertension columns.
diagnostics <- diagnostics %>%
left_join((ht_calc %>% select(1, 6)), by = "eid") %>%
left_join((ht_calc3 %>% select(1, 13)), by = "eid")
# Calculating eGFR columns in men.
diagnostics_m <- diagnostics %>%
filter(sex == 0) %>%
mutate(eGFR_f = rep(NA, (nrow(diagnostics) - sum(diagnostics$sex))),
eGFR = eGFR_m,
CKD3_f = rep(NA, (nrow(diagnostics) - sum(diagnostics$sex))),
CKD3 = CKD3_m,
CKD4_f = rep(NA, (nrow(diagnostics) - sum(diagnostics$sex))),
CKD4 = CKD4_m,
CKD5_f = rep(NA, (nrow(diagnostics) - sum(diagnostics$sex))),
CKD5 = CKD5_m,
hyperalamino_f = rep(NA, (nrow(diagnostics) - sum(diagnostics$sex))),
hyperalamino = hyperalamino_m,
hypergamglut_f = rep(NA, (nrow(diagnostics) - sum(diagnostics$sex))),
hypergamglut = hypergamglut_m)
# Calculating eGFR columns in women.
diagnostics_f <- diagnostics %>%
filter(sex == 1) %>%
mutate(eGFR_m = rep(NA, sum(diagnostics$sex)),
eGFR = eGFR_f,
CKD3_m = rep(NA, sum(diagnostics$sex)),
CKD3 = CKD3_f,
CKD4_m = rep(NA, sum(diagnostics$sex)),
CKD4 = CKD4_f,
CKD5_m = rep(NA, sum(diagnostics$sex)),
CKD5 = CKD5_f,
hyperalamino_m = rep(NA, sum(diagnostics$sex)),
hyperalamino = hyperalamino_f,
hypergamglut_m = rep(NA, sum(diagnostics$sex)),
hypergamglut = hypergamglut_f)
# Combining together
diagnostics <- rbind(diagnostics_m, diagnostics_f)
# Making final diagnostics dataset.
diagnostics <- diagnostics %>%
arrange(eid) %>%
select(eid, sex, age, bmi, obese, creat, eGFR, CKD3, CKD4, CKD5, chol, trig, gluc, hyperchol, hypertrig, hypergluc, alamino, gamglut, hyperalamino, hypergamglut, hyperalt, hyperggt, hypertension_measured, hypertension_measured_both, urate, alcohol_intake_frequency_f1558_0_0, current_tobacco_smoking_f1239_0_0, contains("ethnic"))
# Cleaning up.
rm(diagnostics_f, diagnostics_m, bp, bp2, ht_calc, ht_calc2, ht_calc3)
# Combining phenotypes files together.
combined_data <- self_report_med %>%
left_join(icd10_data, by = "eid") %>%
left_join(self_report_illness_data, by = 'eid') %>%
left_join(diagnostics, by = 'eid') %>%
mutate(gout_ult = ifelse(hospital_ult_exclude, FALSE, afs),
gout_winnard = gout_hospital | gout_ult | colchicine,
gout_drug = gout_ult | colchicine | (probenecid & !hospital_ult_exclude),
goutall = ifelse(gout_self_report | gout_hospital | gout_drug, TRUE, FALSE))
# Cleaning up.
rm(diagnostics, icd10_data, self_report_illness_data, self_report_med)
# Making a dataframe with QC metrics for each individual.
sample_qc_ukbtool <- ukb_gen_sqc_names(read_delim(path(ukbb_dir,'genetic_files/ukb_sqc_v2.txt'), delim = " ", col_names = FALSE))
# Reading fam file into R.
fam <- read_table(path(ukbb_dir, 'genetic_files/ukb12611_cal_chr22_v2_s488285.fam'), col_names= c("FID", "IID", "PID", "MID", "SEX", "AFF")) %>%
rownames_to_column()
# Adding fam file columns onto qc dataframe.
sample_qc_fam <- cbind(sample_qc_ukbtool, fam)
# Making the QC exclusion column, based on sex chromosome aneuploidy, gender mismatch or heterozygosity outliers (1,812 total).
sample_qc_fam <- sample_qc_fam %>%
mutate(eid = IID,
qc_exclude = ifelse(putative_sex_chromosome_aneuploidy == 1 | submitted_gender != inferred_gender | het_missing_outliers == 1, TRUE, FALSE))
# Cleaning up.
rm(fam, sample_qc_ukbtool)
# Defining controls (to aid in gout definitions) as anyone without any gout definition (ULT, colchicine, hospital, self-report) and they can't be on corticosteroids or NSAIDs.
combined_data <- combined_data %>%
mutate(control = !goutall & !control_exclude_drug_criteria)
# Assigning gout affection status and remove qc excluded individuals.
gout_affection <- sample_qc_fam %>%
mutate(eid = IID) %>%
left_join(combined_data, by = "eid") %>%
filter(!qc_exclude) %>%
mutate(goutaff1 = case_when(gout_self_report | gout_ult ~ TRUE,
control ~ FALSE),
goutaff2 = case_when(gout_self_report | gout_ult | gout_hospital ~ TRUE,
control ~ FALSE)) %>%
filter(!is.na(goutaff1))
# Cleaning up.
rm(sample_qc_fam)
# Making vector for filtering for self-reported ethnicity groups.
inc_eth_groups <- c("White", "British", "Irish", "Any other white background")
# Extracting individuals in the above ethnicity groups.
gout_pca <- gout_affection %>%
select(eid, contains("ethnic"), starts_with("pc"), goutaff1, in_white_british_ancestry_subset, used_in_pca_calculation, in_kinship_table, excess_relatives) %>%
filter(ethnic_background_f21000_0_0 %in% inc_eth_groups)
# Defining boundaries of caucasian PCs.
brit_boundaries <- gout_pca %>%
filter(genetic_ethnic_grouping_f22006_0_0 == "Caucasian") %>%
select(starts_with("pc")) %>%
gather("PC", "value") %>%
group_by(PC) %>%
summarise_at(vars(value), list(min = min, max = max))
# Filtering all individuals who fit within the PCA limits of the Caucasian group (including non-self-report).
eth_goutaff <- gout_pca %>%
gather("PC", "PC_value", starts_with("pc")) %>%
left_join(brit_boundaries, by = "PC") %>%
filter(PC_value > min & PC_value < max) %>%
select(eid, ethnic_background_f21000_0_0, goutaff1) %>%
group_by(eid, ethnic_background_f21000_0_0, goutaff1) %>%
tally() %>%
filter(n == 40) %>%
ungroup()
# Filtering gout cases for PC exclusions.
gout_affection <- gout_affection %>%
filter(eid %in% unique(eth_goutaff$eid))
# Cleaning up.
rm(eth_goutaff, brit_boundaries, gout_pca, inc_eth_groups)
# Loading relatedness data.
related <- read_delim(path(ukbb_dir, "genetic_files/ukb12611_rel_s488363.dat"), col_names = TRUE, delim = " ")
# Extracting affected related individuals.
affected_related <- related %>%
filter(ID1 %in% gout_affection$eid,
ID2 %in% gout_affection$eid)
# Extracting gout case ids.
gout_ids <- gout_affection %>%
filter(goutaff1) %>%
pull(eid)
# Extracting control ids.
control_ids <- gout_affection %>%
filter(!goutaff1) %>%
pull(eid)
# Identifying first degree relatives and their gout status.
first_degree <- affected_related %>%
filter(Kinship > 0.177) %>%
mutate(related_category = case_when(ID1 %in% gout_ids & ID2 %in% gout_ids ~ "gout_gout",
(ID1 %in% gout_ids & ID2 %in% control_ids) | (ID1 %in% control_ids & ID2 %in% gout_ids) ~ "gout_control",
ID1 %in% control_ids & ID2 %in% control_ids ~ "control_control"))
# Making function to create a table showing the number of people an individual is related to and affection status.
create_related_table <- function(related_df){
# Creating temporary tibbles for each category.
gg <- tibble(eid = c(related_df %>% filter(related_category == "gout_gout") %>% pull(ID1),
related_df %>% filter(related_category == "gout_gout") %>% pull(ID2)),
category = "gout_gout")
gc <- tibble(eid = c(related_df %>% filter(related_category == "gout_control") %>% pull(ID1),
related_df %>% filter(related_category == "gout_control") %>% pull(ID2)),
category = "gout_control")
cc <- tibble(eid = c(related_df %>% filter(related_category == "control_control") %>% pull(ID1),
related_df %>% filter(related_category == "control_control") %>% pull(ID2)),
category = "control_control")
# Combining together and making table.
indiv_categories <- bind_rows(gg, gc, cc) %>%
group_by(eid, category) %>%
tally() %>%
spread(category, n, fill = 0) %>%
mutate(control_control = ifelse("control_control" %in% names(.), control_control, 0),
gout_control = ifelse("gout_control" %in% names(.), gout_control, 0),
gout_gout = ifelse("gout_gout" %in% names(.), gout_gout, 0),
total = control_control + gout_control + gout_gout) %>%
left_join(., gout_affection %>% select(eid, goutaff1), by = "eid") %>%
ungroup()
return(indiv_categories)
}
# Removing one of the pair for people with only one pair relationship.
start <- ukb_gen_rel_count(affected_related)
# Defining number of relations by person.
temp_table <- create_related_table(related_df = first_degree)
# Finding people with more than one relationship and the part of pair to be dealt with later.
multi_relations <- first_degree %>%
filter(ID1 %in% (temp_table %>% filter(total > 1) %>% pull(eid)) | ID2 %in% (temp_table %>% filter(total > 1) %>% pull(eid)))
# Creating a list of all people involved with multirelation relationships.
multi_relations_ids <- c(multi_relations$ID1, multi_relations$ID2)
# Listing all controls with only a single relationship that aren't also part of a multi-relationship pair.
single_controls <- temp_table %>%
filter(control_control == 1,
total == 1,
!(eid %in% multi_relations_ids)) %>%
pull(eid)
# Extracting 1st IDs of these pairs.
trc1 <- first_degree %>%
filter(ID1 %in% single_controls,
ID2 %in% single_controls) %>%
pull(ID1)
# Extracting 2nd IDs of these pairs.
trc2 <- first_degree %>%
filter(ID1 %in% single_controls,
ID2 %in% single_controls) %>%
pull(ID2)
# Removing the IDs that were in column 2.
mid <- ukb_gen_rel_count((first_degree %>% filter(!(ID1 %in% trc2 | ID2 %in% trc2))))
# Making table of single gout_control relationship pairs where we remove the control.
single_gout_controls <- temp_table %>%
filter(gout_control == 1,
total == 1,
!(eid %in% multi_relations_ids)) %>%
filter(!goutaff1) %>%
pull(eid)
# Making list of samples to remove.
remove_samples <- c(trc2, single_gout_controls)
# Making midpoint file.
mid2 <- ukb_gen_rel_count((first_degree %>% filter(!(ID1 %in% remove_samples | ID2 %in% remove_samples))))
# Finding total number of rows of first_degree table.
curr_rows <- nrow(first_degree)
# Setting i to 1.
i <- 1
# Making empty list for output of below while statement.
order_removed <- list()
# Making loop that runs while curr_rows is greater than or equal to 0.
while(curr_rows >= 0){
# Making temporary related table.
temp_related <- first_degree %>%
filter(!(ID1 %in% remove_samples | ID2 %in% remove_samples))
# Setting curr_rows to the number of rows in temp_related.
curr_rows <- nrow(temp_related)
# If there are no more rows, then ending the loop.
if(curr_rows == 0){
break
}
# Making temporary table with relatedness metrics.
temp_table <- create_related_table(related_df = temp_related)
# Removing all related gout to gout relationships, then removing controls from gout to control, then removing related controls.
if(sum(temp_table$gout_gout) > 0){
remove_id <- temp_table %>%
arrange(desc(gout_gout), desc(total)) %>%
slice(1) %>%
pull(eid)
message("gg")
} else if(sum(temp_table$gout_control) > 0){
remove_id <- temp_table %>%
arrange(desc(gout_control), desc(total)) %>%
filter(!goutaff1) %>%
slice(1) %>%
pull(eid)
message("gc")
} else if(sum(temp_table$control_control) > 0){
remove_id <- temp_table %>%
arrange(desc(control_control), desc(total)) %>%
filter(!goutaff1) %>%
slice(1) %>%
pull(eid)
message("cc")
}
# Assigning the removed id to the order_removed list.
order_removed[[i]] <- remove_id
# Adding the removed id to the remove_samples vector.
remove_samples <- c(remove_samples, remove_id)
# Outputting the current state of this loop.
message(nrow(temp_related))
# Adding one to the iteration number.
i <- i + 1
}
# Reporting statistics after removing related ids.
end <- ukb_gen_rel_count((affected_related %>% filter(!(ID1 %in% remove_samples | ID2 %in% remove_samples))))
# Saving out as a new dataframe.
gout_affection2 <- gout_affection %>% filter(!eid %in% remove_samples)
# Cleaning up.
rm(affected_related, start, end, first_degree, gout_affection, mid, mid2, multi_relations, order_removed, related, temp_related, temp_table, control_ids, curr_rows, gout_ids, i, multi_relations_ids, remove_id, remove_samples, single_controls, single_gout_controls, trc1, trc2, create_related_table)
# Filtering combined data based on gout_affection2.
combined_data <- combined_data %>%
filter(eid %in% (gout_affection2 %>% filter(!is.na(goutaff1)) %>% pull(eid)))
# Defining final phenotypes.
final_data <- combined_data %>%
mutate(gout = gout_ult | gout_self_report,
hypertension = hypertension_self_report1 | hypertension_self_report2 | hosp_hypertension | antihyper | diuretics | beta_blockers | ace_inhibitors | atiir_antagonists | calcium_blockers,
dyslipidemia = chol_self_report | statin | ezetimibe | fenofibrate | hosp_hyperlipidemia,
type2_diabetes = ((diab_self_report | t2d_self_report | hosp_t2d | insulin | metformin | hypergluc) & !t1d_self_report & !hosp_t1d),
ckd_stage3 = hosp_ckd_stage3 | CKD3,
ckd_stage4 = hosp_ckd_stage4 | CKD4,
end_stage_renal = renalfailure_self_report | hosp_ckd_endstage | hosp_ckd_stage5 | CKD5,
obese = obese,
coronary_heart_disease = mi_self_report | angina_self_report | hosp_ischemic_heart,
heart_failure = heartfailure_self_report | hosp_heartfail,
liver_disease = hosp_liver_disease | liver_self_report | hyperggt | hyperalt,
cerebrovascular_disease = hosp_cerebrovascular | stroke_self_report1,
sex = factor(sex, levels = c(0, 1), labels = c("Male", "Female")),
urate = urate / 59.48) %>%
select(eid, gout, hypertension, dyslipidemia, type2_diabetes, ckd_stage3, ckd_stage4, end_stage_renal, obese, coronary_heart_disease, heart_failure, liver_disease, cerebrovascular_disease, age, sex, urate, alcohol_intake_frequency_f1558_0_0, current_tobacco_smoking_f1239_0_0, bmi, ULT)
# Combining all UK Biobank phenotypes of interest into a single file.
ukbb_pheno <- final_data %>%
mutate(eid = as.numeric(eid)) %>%
rename(IID = eid,
GOUT = gout,
AGECOL = age,
SEX = sex,
URATE = urate) %>%
mutate(Geno.PCVector1 = NA,
Geno.PCVector2 = NA,
Geno.PCVector3 = NA,
Geno.PCVector4 = NA,
Geno.PCVector5 = NA,
Geno.PCVector6 = NA,
Geno.PCVector7 = NA,
Geno.PCVector8 = NA,
Geno.PCVector9 = NA,
Geno.PCVector10 = NA,
Geno.PCVector1_Oc = NA,
Geno.PCVector2_Oc = NA,
Geno.PCVector3_Oc = NA,
Geno.PCVector4_Oc = NA,
Geno.PCVector5_Oc = NA,
Geno.PCVector6_Oc = NA,
Geno.PCVector7_Oc = NA,
Geno.PCVector8_Oc = NA,
Geno.PCVector9_Oc = NA,
Geno.PCVector10_Oc = NA,
AGE1ATK = NA,
DURATION = NA,
TOPHIGOUT = NA,
EROSIONS = NA,
NUMATK = NA,
PROPHY = NA,
Geno.SpecificAncestry = "European",
BMI = bmi,
HYPERTENSION = hypertension,
DIABETES = type2_diabetes,
HEART = coronary_heart_disease | heart_failure,
KIDNEY = ckd_stage3 | ckd_stage4 | end_stage_renal,
LIPIDS = dyslipidemia,
STROKE = cerebrovascular_disease,
TOTALALC = case_when(alcohol_intake_frequency_f1558_0_0 == "Daily or almost daily" ~ 14,
alcohol_intake_frequency_f1558_0_0 == "Three or four times a week" ~ 4,
alcohol_intake_frequency_f1558_0_0 == "Once or twice a week" ~ 2,
TRUE ~ NA_real_),
SUGDRINK = NA,
CURSMOKE = current_tobacco_smoking_f1239_0_0 == "Yes, on most or all days",
FAMGOUT = NA,
FAMGOUTNUM = NA,
Geno.SampleID = NA,
Pheno.Study = "UK Biobank") %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Cleaing up.
rm(refresh_ukbb_data, bd_refined, gout_affection2, ukbb_dir, logicfactor, logicfactor2, combined_data, final_data)
Next, related individuals (including duplicate samples) were excluded from the various cohorts in the CoreExome data (based on the 9,157 individuals in all_pheno). Individuals with KING kinship coefficient of > 0.177 (i.e. first degree related or more) were removed from the analysis, preferentially keeping gout cases over controls where possible.
# Given that we don't care about the FID column for this, we will recreate the filtered genotyped CoreExome files so that the FID and IID are matching, and all are unique.
# Testing whether all Geno.SampleID values are unique in the all_pheno file.
length(unique(all_pheno$Geno.SampleID)) == length(all_pheno$Geno.SampleID)
# Extracting these IDs as a variable.
unique_ids <- all_pheno %>%
select(Geno.SampleID)
# Reading in list of IDs from CoreExome fam file.
fam <- read_delim("/Volumes/archive/merrimanlab/raid_backup/New_Zealand_Chip_data/CoreExome/QC_MergedBatches/Final_Data/CZ-MB1.2-QC1.10_CoreExome24-1.0-3_genotyped-QCd_rsIDconverted.fam", delim = " ", col_names = F)
# Filtering fam file to keep only those in the unique_ids file.
fam_filtered <- fam %>%
filter(X2 %in% unique_ids$Geno.SampleID) %>%
select(X1, X2)
# Writing out file for filtering the fam file.
write_delim(fam_filtered, delim = "\t", file = path(scratch_path, "/Output/Temp/unique_ids.txt"), col_names = F)
# Cleaning up.
rm(fam_filtered, unique_ids, fam)
# Setting directory as a variable.
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
# Filtering CoreExome PLINK files to only include individuals for
plink1.9b6.10 --bfile /Volumes/archive/merrimanlab/raid_backup/New_Zealand_Chip_data/CoreExome/QC_MergedBatches/Final_Data/CZ-MB1.2-QC1.10_CoreExome24-1.0-3_genotyped-QCd_rsIDconverted --keep $PRS_SCRATCH/Output/Temp/unique_ids.txt --chr 1-22 --make-bed --out $PRS_SCRATCH/Output/Temp/for_relatedness
# Extracting IDs and cleaned up SEX column for all individuals.
all_pheno2 <- all_pheno %>%
select(Geno.SampleID, SEX) %>%
mutate(SEX = as.numeric(factor(SEX, levels = c("Male", "Female"))))
# Reading in filtered fam file.
fam2 <- read_delim(path(scratch_path, "/Output/Temp/for_relatedness.fam"), delim = " ", col_names = F)
# Overwriting FID column with IID column and adding SEX column to fam file.
fam_clean <- fam2 %>%
left_join(all_pheno2, by = c("X1" = "Geno.SampleID")) %>%
mutate(X1 = X2,
X3 = 0,
X4 = 0,
X5 = SEX) %>%
select(-SEX)
# Writing out new fam file over existing fam file.
write_delim(fam_clean, delim = " ", file = path(scratch_path, "/Output/Temp/for_relatedness.fam"), col_names = F)
# Setting directory as a variable.
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
# Downloading king binary file.
wget -O $PRS_SCRATCH/Tools/Linux-king.tar.gz https://www.kingrelatedness.com/executables/Linux-king227.tar.gz
# Extracting binary files from tar.gz file.
tar xzf $PRS_SCRATCH/Tools/Linux-king.tar.gz --directory $PRS_SCRATCH/Tools
# Running king function on for_relatedness PLINK files.
$PRS_SCRATCH/Tools/king -b $PRS_SCRATCH/Output/Temp/for_relatedness.bed --kinship --prefix $PRS_SCRATCH/Output/Temp/king
# Loading in computed relationships file.
relationships <- read_delim(path(scratch_path, "/Output/Temp/king.kin0"), delim = "\t")
# Extracting all relationships that are at first degree or higher (total of 517).
related <- relationships %>%
filter(Kinship > 0.177)
# Extracting all unique IDs that are in this related list (854 total).
related_ids_unique <- c(related$ID1, related$ID2) %>%
unique()
# Extracting gout status for these individuals.
all_pheno_related <- all_pheno %>%
filter(Geno.SampleID %in% related_ids_unique) %>%
select(Geno.SampleID, GOUT)
# Extracting ID columns of related individuals.
related_pairs <- related %>%
select(ID1, ID2)
# Merging together with gout status then labelling pair based on combination of gout and control status.
related_pairs2 <- related_pairs %>%
left_join(all_pheno_related, by = c("ID1" = "Geno.SampleID")) %>%
left_join(all_pheno_related, by = c("ID2" = "Geno.SampleID")) %>%
mutate(Pair = case_when(GOUT.x == "Gout" & GOUT.y == "Gout" ~ "GG",
GOUT.x == "Gout" & GOUT.y == "Control" ~ "GC",
GOUT.x == "Control" & GOUT.y == "Gout" ~ "CG",
GOUT.x == "Control" & GOUT.y == "Control" ~ "CC"))
# Extracting all IDs, including duplicated (1,034 total).
related_ids_all <- c(related$ID1, related$ID2)
# Isolating duplicated IDs.
related_ids_dup <- related_ids_all[duplicated(related_ids_all) | duplicated(related_ids_all, fromLast = TRUE)]
# Isolating non-duplicated IDs.
related_ids_notdup <- related_ids_all[!related_ids_all %in% related_ids_dup]
# Extracting 349 pairs of non-duplicated IDs and adding columns for filtering to exclude the first or second of the pair based on gout status.
related_pairs_unique <- related_pairs2 %>%
filter(ID1 %in% related_ids_notdup,
ID2 %in% related_ids_notdup) %>%
mutate(keep1 = Pair %in% c('GG', 'GC', 'CC'),
keep2 = !keep1)
# Making list of individuals for exclusion based on the above pairs (total 349).
remove1 <- c(related_pairs_unique %>% filter(keep2) %>% pull(ID1),
related_pairs_unique %>% filter(keep1) %>% pull(ID2))
# Finding pairs including duplicated individuals as the first ID (total 168, this is all remaining pairs).
related_pairs_multi <- related_pairs2 %>%
filter(!ID1 %in% related_pairs_unique$ID1)
# Filtering and adding column for removal of individuals that are controls in a pair of gout-controls (total 41 pairs).
related_pairs_multi_gc <- related_pairs_multi %>%
filter(Pair %in% c("GC", "CG")) %>%
mutate(remove_id = case_when(Pair == "GC" ~ ID2,
Pair == "CG" ~ ID1))
# Making second list of individuals for exclusion (37 controls).
remove2 <- related_pairs_multi_gc$remove_id %>%
unique()
# Finding pairs with controls in both columns, excluding those with a control that was already removed.
related_pairs_multi_cc <- related_pairs_multi %>%
filter(Pair == "CC",
!(ID1 %in% remove2 | ID2 %in% remove2))
# Making third list of individuals for exclusion (32 controls).
remove3 <- related_pairs_multi_cc$ID1 %>%
unique()
# Isolating all remaining pairs (all gout/gout).
related_pairs_multi_gg <- related_pairs_multi %>%
filter(Pair == "GG")
# Making final list of 28 individuals for exclusion based on first column of each gout-gout pair.
remove4 <- related_pairs_multi_gg$ID1 %>%
unique()
# Combining all removal lists together.
remove_list <- c(remove1, remove2, remove3, remove4)
# Removing all 446 individuals that were either duplicates or closely related.
all_pheno_dedup <- all_pheno %>%
filter(!(Geno.SampleID %in% remove_list))
# Saving the deduplicated all_pheno file along with the UKBB phenotype file.
save(all_pheno_dedup, ukbb_pheno, file = path(scratch_path, "/Output/Phenotypes_dedup.RData"))
# Cleaning up.
rm(list = ls()[str_detect(ls(), "relat|remov")], all_pheno2, fam2, fam_clean)
Calculating PRS
The UKBB Gout PRS and Tin Urate PRS (including variants) were calculated for all European and Polynesian samples. Using the results of the UKBB GWAS (only using genotyped variants), 21 PRS metrics were generated, this includes the complete PRS, the 19 variants individually, and the PRS without either ABCG2 variant. For the Tin GWAS this number was 84 (full PRS, 19 variant PRS, and 82 unique variants, some of which overlap with the 19 of the gout GWAS).
Initially, the complete list of unique variants in both lead variant files were extracted from the CoreExome PLINK files.
# Loading lists of PRS variants.
load(here("Output/UKBB_Gene_OR.RData"))
load(here("Output/Tin_Gene_OR.RData"))
# Making list of variants and locations for extracting SNPs from the CoreExome PLINK file.
for_plink <- UKBB_Gene_OR %>%
select(CHR, BP, RSID) %>%
rbind(select(Tin_Gene_OR, CHR:RSID)) %>%
unique() %>%
mutate(BP2 = BP) %>%
select(CHR, BP, BP2, RSID) %>%
arrange(CHR, BP)
# Write out variants/locations file for PLINK filtering.
write_delim(for_plink, file = path(scratch_path, "Output/Temp/UKBB_SNPs_Plink.txt"), col_names = F)
# Cleaning up.
rm(for_plink)
# Setting directory as a variable.
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
# Extracting SNPs of interest from CoreExome PLINK files.
plink1.9b6.10 --bfile /Volumes/archive/merrimanlab/raid_backup/New_Zealand_Chip_data/CoreExome/QC_MergedBatches/Final_Data/CZ-MB1.2-QC1.10_CoreExome24-1.0-3_genotyped-QCd_rsIDconverted --extract range $PRS_SCRATCH/Output/Temp/UKBB_SNPs_Plink.txt --make-bed --out $PRS_SCRATCH/Output/Temp/SNPs
# Converting binary PLINK files to non-binary format.
plink1.9b6.10 --bfile $PRS_SCRATCH/Output/Temp/SNPs --recode --out $PRS_SCRATCH/Output/Temp/SNPs
NExt, the complete list of unique variants in both lead variant files were extracted from the UK Biobank BGEN files.
# Extracting SNPs from the UK Biobank and converting to merged PLINK file.
# Extracting unique chromosome/BP pairs of all variants in the two variant lists.
tmp <- UKBB_Gene_OR %>%
select(CHR, BP) %>%
rbind(select(Tin_Gene_OR, CHR, BP)) %>%
unique() %>%
arrange(CHR, BP)
# Adding 0 before chromosomes under 10 and making BGEN filtering format.
bgen_range1 <- tmp %>%
filter(CHR < 10) %>%
mutate(BGEN = paste0("0", CHR, ":", BP, "-", BP))
# Making BGEN filtering format for remaining chromosomes.
bgen_range2 <- tmp %>%
filter(CHR > 9) %>%
mutate(BGEN = paste0(CHR, ":", BP, "-", BP))
# Combining the two lists together.
bgen_range <- rbind(bgen_range1, bgen_range2) %>%
arrange(CHR, BP) %>%
select(BGEN)
# Writing out file for filtering UK Biobank variants.
write_delim(bgen_range, file = path(scratch_path, "Output/Temp/PRS_SNPs_BGEN.txt"), delim = "\n", col_names = F)
# Writing file containing the absolute path to find plink files for merging (to be used in bash script below).
write_delim(as_tibble(paste0(path(scratch_path, "Output/Temp"), "/chr", unique(tmp$CHR), "_PRS")), file = path(scratch_path, "Output/Temp/mergefile_prs.txt"), delim = "\n", col_names = F)
# Making file with list of unique chromosomes for bash script below.
write_delim(unique(tmp %>% select(CHR)), file = path(scratch_path, "Output/Temp/unique_chr_ukbb.txt"), col_names = F)
# Cleaning up.
rm(bgen_range1, bgen_range2, bgen_range, tmp)
# Setting directories as a variable.
PRS_SCRATCH=/Volumes/scratch/merrimanlab/Nick/PRS
PRS_DIR=/Volumes/archive/userdata/student_users/nicksumpter/Documents/PhD/PRS
UKBB_DIR=/Volumes/archive/merrimanlab_nobackup/ukbio/EGAD00010001474
# Extracting variants of interest from UK Biobank genotype files in parallel plus cleaning up the resulting VCF files.
parallel "bgenix -g {1}/ukb_imp_chr{3}_v3.bgen -vcf -incl-range {2}/Output/Temp/PRS_SNPs_BGEN.txt | bcftools reheader -h {1}/bgen_to_vcf/new_header.txt | bcftools annotate --rename-chrs {1}/bgen_to_vcf/rename_contigs.txt | bgzip -c > {2}/Output/Temp/chr{3}_forPRS.vcf.gz" ::: $UKBB_DIR ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/unique_chr_ukbb.txt
# Converting the VCF files made in the previous step into PLINK format.
parallel "plink1.9b4.9 --vcf {1}/Output/Temp/chr{2}_forPRS.vcf.gz --make-bed --out {1}/Output/Temp/chr{2}_PRS" ::: $PRS_SCRATCH :::: $PRS_SCRATCH/Output/Temp/unique_chr_ukbb.txt
# Merging the above PLINK files together based on the mergefile_prs.txt file that was made above.
plink1.9b6.10 --merge-list $PRS_SCRATCH/Output/Temp/mergefile_prs.txt --make-bed --out $PRS_SCRATCH/Output/Temp/merged_PRS
# Converting binary PLINK file into non-binary format.
plink1.9b6.10 --bfile $PRS_SCRATCH/Output/Temp/merged_PRS --recode --out $PRS_SCRATCH/Output/Temp/merged_PRS
Finally, the various PRS metrics were calculated in each cohort of interest.
# Reading in filtered Tin et al. summary statistics.
tin <- vroom(path(scratch_path, "/Output/Temp/tin_filtered.txt"),
delim = "\t",
col_names = T)
# Finding Tin variants that are also in the UKBB Gout GWAS 19 variant list. Also making risk allele column accurately reflect the correct allele.
tin2 <- tin %>%
filter(RSID %in% UKBB_Gene_OR$RSID) %>%
mutate(risk_allele = case_when(Effect > 0 ~ toupper(Allele1),
Effect < 0 ~ toupper(Allele2)))
# Testing that all variants are in the same effect direction, so I can just take the absolute value of the urate effect.
sum(tin2$risk_allele == UKBB_Gene_OR$Effect_Allele)
# Extracting 8 SNPs that are not in common between the two lists, 4 of which are the secondary hits at the same loci.
different <- UKBB_Gene_OR %>%
filter(!(RSID %in% Tin_Gene_OR$RSID)) %>%
select(CHR:BP, BP1:Locus_Name)
# Excluding the secondary hits at the same loci, leaving four variants.
different <- different %>%
filter(!(Locus_Name %in% c("SLC2A9", "WDR1", "ABCG2", "NRXN2")))
# Making filtered gout GWAS lead variant table based on the variants that either directly overlap or those that are in the "different" table. Also adding Tin_Effect column based on the absolute effect of the variant on urate. Also reverting OR column to the original OR. Finally extracting the log of the gout OR and finding a ratio of the effect sizes. This variant list can be used for direct comparisons of the Tin PRS weightings and Gout PRS weightings.
tmp_gene_or <- UKBB_Gene_OR %>%
mutate(Tin_Effect = abs(tin2$Effect)) %>%
filter(RSID %in% c(Tin_Gene_OR$RSID, different$RSID)) %>%
mutate(OR = case_when(is.na(OR_old) ~ OR,
TRUE ~ OR_old),
Gout_Effect = log(OR),
Effect_Ratio = Gout_Effect / Tin_Effect) %>%
select(CHR:BP, EAF, Locus_Name, Gout_Effect, Tin_Effect, Effect_Ratio)
# Identifying all 11 CHR/BP positions present in both PRS variant lists.
tmp <- UKBB_Gene_OR %>%
select(CHR, BP) %>%
rbind(select(Tin_Gene_OR, CHR, BP)) %>%
filter(duplicated(BP)) %>%
arrange(CHR, BP)
# Extracting these 11 variants then comparing the effect allele for gout with effect allele for urate to show they are the same.
tmp1 <- UKBB_Gene_OR %>%
filter(BP %in% tmp$BP) %>%
select(Effect_Allele) %>%
cbind(Tin_Gene_OR %>% filter(BP %in% tmp$BP) %>% select(Effect_Allele))
# Making combined variant list with all 90 unique lead variants for both GWAS (corresponding to the PLINK files).
Combined_Gene_OR <- UKBB_Gene_OR %>%
select(CHR, BP, RSID, Effect_Allele, Alternate_Allele) %>%
rbind(Tin_Gene_OR %>% select(CHR, BP, RSID, Effect_Allele, Alternate_Allele)) %>%
arrange(CHR, BP) %>%
unique()
# Reading in PLINK map file for the CoreExome PLINK files with all 90 variants from the combination of the Gout PRS and Tin PRS variant lists.
map <- read_delim(path(scratch_path, "/Output/Temp/SNPs.map"),
delim = "\t",
col_names = FALSE)
# Reading in corresponding PLINK ped file.
x <- read_delim(path(scratch_path, "/Output/Temp/SNPs.ped"),
delim = " ",
col_names = FALSE,
col_types = cols(.default = col_character()))
# Renaming columns of x based on map file.
colnames(x)[1:6] <- c("FID", "IID", "PID", "MID", "SEX", "AFF")
colnames(x)[seq(from = 7, to = ncol(x) - 1, by = 2)] <- str_c(map$X2, "_1")
colnames(x)[seq(from = 8, to = ncol(x), by = 2)] <- str_c(map$X2, "_2")
# Filtering x to only include individuals that are in the all_pheno_dedup table.
x <- x %>%
filter(IID %in% (all_pheno_dedup$Geno.SampleID))
# Setting the number of columns of x as a variable.
num_cols <- ncol(x)
# Converting character genotypes into numeric genotypes based on risk allele = 1.
# For each variant in Combined_Gene_OR (90 total).
for(i in 1:nrow(Combined_Gene_OR)){
# Modifying the column at 2 * i + 5 (i.e. the column corresponding to the first allele of that variant) to first replace any 0 with NA, then replace the Effect Allele with a 1 and the Alternate allele with a 0. Finally, converting the column to a numeric column.
x[[2 * i + 5]] <- x[[2 * i + 5]] %>%
str_replace("0", NA_character_) %>%
str_replace(Combined_Gene_OR[[i, "Effect_Allele"]], "1") %>%
str_replace(Combined_Gene_OR[[i, "Alternate_Allele"]], "0") %>%
as.numeric()
# Doing the same for the column at 2 * i + 6 (i.e. the column corresponding to the second allele of that variant).
x[[2 * i + 6]] <- x[[2 * i + 6]] %>%
str_replace("0", NA_character_) %>%
str_replace(Combined_Gene_OR[[i, "Effect_Allele"]], "1") %>%
str_replace(Combined_Gene_OR[[i, "Alternate_Allele"]], "0") %>%
as.numeric()
# Making a temporary column that is the sum of both numeric alleles.
x <- x %>%
mutate("TEMP" = (x[[2 * i + 5]] + x[[2 * i + 6]]))
# Renaming the column names such that the temporary column is renamed to the corresponding RSID.
colnames(x) <- c(colnames(x[1:((num_cols - 1) + i)]), Combined_Gene_OR[[i, "RSID"]])
}
# Subsetting the columns to only include the IID column and the newly defined numeric genotype columns.
x <- x %>%
select(2, (num_cols + 1):ncol(x))
# Saving this table for individual SNP analysis.
x1 <- x
# Subsetting x further to only include variants in the gout PRS.
x <- x %>%
select(1, UKBB_Gene_OR$RSID)
# For each of the 19 variants in the gout PRS, weighting the numeric genotype by the log of the OR for that variant.
for(i in 1:nrow(UKBB_Gene_OR)) {
x[i + 1] <- x[[i + 1]] * log(UKBB_Gene_OR[[i, "OR"]])
}
# Summing across all weighted genotypes to produce the PRS variable.
x$PRS <- rowSums(x[2:(ncol(x))])
# Producing the PRS without either ABCG2 variant, then extracting the IID along with two PRS's.
x <- x %>%
mutate(PRS_noABCG2 = PRS - rs2231142 - rs10011796) %>%
select(IID, PRS, PRS_noABCG2)
# Adding the two PRS variants to the all_pheno_dedup table.
all_pheno_prs <- all_pheno_dedup %>%
left_join(x, by = c("Geno.SampleID" = "IID"))
# Subsetting x1 to only include variants in the urate PRS.
x <- x1 %>%
select(1, Tin_Gene_OR$RSID)
# For each of the 82 variants in the urate PRS, weighting the numeric genotype by the beta for that variant.
for(i in 1:nrow(Tin_Gene_OR)) {
x[i + 1] <- x[[i + 1]] * Tin_Gene_OR[[i, "Beta"]]
}
# Summing across all weighted genotypes to produce the PRS variable.
x$Urate_PRS <- rowSums(x[2:(ncol(x))])
# Extracting the IID and PRS columns.
x <- x %>%
select(IID, Urate_PRS)
# Adding the urate PRS column.
all_pheno_prs <- all_pheno_prs %>%
left_join(x, by = c("Geno.SampleID" = "IID"))
# Extracting just the 15 variants that were in common between the two PRS's.
x <- x1 %>%
select(1, tmp_gene_or$RSID)
# Weighting these variants by the gout effect size.
for(i in 1:nrow(tmp_gene_or)) {
x[i + 1] <- x[[i + 1]] * tmp_gene_or[[i, "Gout_Effect"]]
}
# Making second gout PRS based on these 15 variants.
x$PRS2 <- rowSums(x[2:(ncol(x))])
# Extracting the IID and new PRS column.
x <- x %>%
select(IID, PRS2)
# Adding the new PRS to the full phenotype file.
all_pheno_prs <- all_pheno_prs %>%
left_join(x, by = c("Geno.SampleID" = "IID"))
# Again, extracting just the 15 variants that were in common between the two PRS's.
x <- x1 %>% select(1, tmp_gene_or$RSID)
# Weighting these variants by the urate effect size.
for(i in 1:nrow(tmp_gene_or)) {
x[i + 1] <- x[[i + 1]] * tmp_gene_or[[i, "Tin_Effect"]]
}
# Making second urate PRS based on these 15 variants.
x$Urate_PRS2 <- rowSums(x[2:(ncol(x))])
# Extracting the IID and new urate PRS column.
x <- x %>%
select(IID, Urate_PRS2)
# Adding the new urate PRS to the full phenotype file.
all_pheno_prs <- all_pheno_prs %>%
left_join(x, by = c("Geno.SampleID" = "IID"))
# Again, extracting just the 15 variants that were in common between the two PRS's.
x <- x1 %>% select(1, tmp_gene_or$RSID)
# Making unweighted PRS based on their genotypes.
x$Unweighted_PRS <- rowSums(x[2:(ncol(x))])
# Extracting the IID and unweighted PRS column.
x <- x %>%
select(IID, Unweighted_PRS)
# Adding the unweighted PRS to the full phenotype file.
all_pheno_prs <- all_pheno_prs %>%
left_join(x, by = c("Geno.SampleID" = "IID"))
# Finally adding the numeric genotypes for the 19 gout PRS variants.
x <- x1 %>%
select(1, UKBB_Gene_OR$RSID)
# Adding the gout PRS variants to the full phenotype file.
all_pheno_prs <- all_pheno_prs %>%
left_join(x, by = c("Geno.SampleID" = "IID"))
# Now doing the same as above for the UK Biobank cohort.
# Reading in the PLINK map file and cleaning up the variant column.
map <- read_delim(path(scratch_path, "/Output/Temp/merged_PRS.map"),
delim = "\t",
col_names = FALSE) %>%
separate(X2, into = c("X2", NA), sep = ",")
# Reading in the PLINK ped file.
x <- read_delim(path(scratch_path, "/Output/Temp/merged_PRS.ped"),
delim = " ",
col_names = FALSE,
col_types = cols(.default = col_character()))
# Renaming columns of x based on map file.
colnames(x)[1:6] <- c("FID", "IID", "PID", "MID", "SEX", "AFF")
colnames(x)[seq(from = 7, to = ncol(x) - 1, by = 2)] <- str_c(map$X2, "_1")
colnames(x)[seq(from = 8, to = ncol(x), by = 2)] <- str_c(map$X2, "_2")
# Making list of IDs of interest.
ids <- ukbb_pheno$IID
# Filtering x to only include the ids of interest.
x <- x %>%
mutate(IID = as.numeric(IID)) %>%
filter(IID %in% ids)
# Setting the number of columns of x as a variable.
num_cols <- ncol(x)
# Converting character genotypes into numeric genotypes based on risk allele = 1.
# For each variant in Combined_Gene_OR (90 total).
for(i in 1:nrow(Combined_Gene_OR)){
# Modifying the column at 2 * i + 5 (i.e. the column corresponding to the first allele of that variant) to first replace any 0 with NA, then replace the Effect Allele with a 1 and the Alternate allele with a 0. Finally, converting the column to a numeric column.
x[[2 * i + 5]] <- x[[2 * i + 5]] %>%
str_replace("0", NA_character_) %>%
str_replace(Combined_Gene_OR[[i, "Effect_Allele"]], "1") %>%
str_replace(Combined_Gene_OR[[i, "Alternate_Allele"]], "0") %>%
as.numeric()
# Doing the same for the column at 2 * i + 6 (i.e. the column corresponding to the second allele of that variant).
x[[2 * i + 6]] <- x[[2 * i + 6]] %>%
str_replace("0", NA_character_) %>%
str_replace(Combined_Gene_OR[[i, "Effect_Allele"]], "1") %>%
str_replace(Combined_Gene_OR[[i, "Alternate_Allele"]], "0") %>%
as.numeric()
# Making a temporary column that is the sum of both numeric alleles.
x <- x %>%
mutate("TEMP" = (x[[2 * i + 5]] + x[[2 * i + 6]]))
# Renaming the column names such that the temporary column is renamed to the corresponding RSID.
colnames(x) <- c(colnames(x[1:((num_cols - 1) + i)]), Combined_Gene_OR[[i, "RSID"]])
}
# Subsetting the columns to only include the IID column and the newly defined numeric genotype columns.
x <- x %>%
select(2, (num_cols + 1):ncol(x))
# Saving this table for individual SNP analysis.
x1 <- x
# Subsetting x further to only include variants in the gout PRS.
x <- x %>%
select(1, UKBB_Gene_OR$RSID)
# For each of the 19 variants in the gout PRS, weighting the numeric genotype by the log of the OR for that variant.
for(i in 1:nrow(UKBB_Gene_OR)) {
x[i + 1] <- x[[i + 1]] * log(UKBB_Gene_OR[[i, "OR"]])
}
# Summing across all weighted genotypes to produce the PRS variable.
x$PRS <- rowSums(x[2:(ncol(x))])
# Producing the PRS without either ABCG2 variant, then extracting the IID along with two PRS's.
x <- x %>%
mutate(PRS_noABCG2 = PRS - rs2231142 - rs10011796) %>%
select(IID, PRS, PRS_noABCG2) %>%
mutate(IID = factor(IID))
# Adding the two PRS variants to the ukbb_pheno table.
all_pheno_prs2 <- ukbb_pheno %>%
mutate(IID = factor(IID)) %>%
left_join(x, by = "IID")
# Subsetting x1 to only include variants in the urate PRS.
x <- x1 %>%
select(1, Tin_Gene_OR$RSID)
# For each of the 82 variants in the urate PRS, weighting the numeric genotype by the beta for that variant.
for(i in 1:nrow(Tin_Gene_OR)) {
x[i + 1] <- x[[i + 1]] * Tin_Gene_OR[[i, "Beta"]]
}
# Summing across all weighted genotypes to produce the PRS variable.
x$Urate_PRS <- rowSums(x[2:(ncol(x))])
# Extracting the IID and PRS columns.
x <- x %>%
select(IID, Urate_PRS) %>%
mutate(IID = factor(IID))
# Adding the urate PRS to the phenotype table.
all_pheno_prs2 <- all_pheno_prs2 %>%
left_join(x, by = c("IID"))
# Extracting just the 15 variants that were in common between the two PRS's.
x <- x1 %>%
select(1, tmp_gene_or$RSID)
# Weighting these variants by the gout effect size.
for(i in 1:nrow(tmp_gene_or)) {
x[i + 1] <- x[[i + 1]] * tmp_gene_or[[i, "Gout_Effect"]]
}
# Making second gout PRS based on these 15 variants.
x$PRS2 <- rowSums(x[2:(ncol(x))])
# Extracting the IID and new PRS column.
x <- x %>%
select(IID, PRS2) %>%
mutate(IID = factor(IID))
# Adding the new gout PRS to the phenotype table.
all_pheno_prs2 <- all_pheno_prs2 %>%
left_join(x, by = c("IID"))
# Again, extracting just the 15 variants that were in common between the two PRS's.
x <- x1 %>% select(1, tmp_gene_or$RSID)
# Weighting these variants by the urate effect size.
for(i in 1:nrow(tmp_gene_or)) {
x[i + 1] <- x[[i + 1]] * tmp_gene_or[[i, "Tin_Effect"]]
}
# Making second urate PRS based on these 15 variants.
x$Urate_PRS2 <- rowSums(x[2:(ncol(x))])
# Extracting the IID and new urate PRS column.
x <- x %>%
select(IID, Urate_PRS2) %>%
mutate(IID = factor(IID))
# Adding the new urate PRS to the phenotype table.
all_pheno_prs2 <- all_pheno_prs2 %>%
left_join(x, by = c("IID"))
# Again, extracting just the 15 variants that were in common between the two PRS's.
x <- x1 %>% select(1, tmp_gene_or$RSID)
# Making unweighted PRS based on their genotypes.
x$Unweighted_PRS <- rowSums(x[2:(ncol(x))])
# Extracting the IID and unweighted PRS column.
x <- x %>%
select(IID, Unweighted_PRS) %>%
mutate(IID = factor(IID))
# Adding the unweighted PRS to the phenotype table.
all_pheno_prs2 <- all_pheno_prs2 %>%
left_join(x, by = c("IID"))
# Finally adding the numeric genotypes for the 19 gout PRS variants.
x <- x1 %>%
select(1, UKBB_Gene_OR$RSID) %>%
mutate(IID = factor(IID))
# Adding the gout PRS variants to the full phenotype file.
all_pheno_prs2 <- all_pheno_prs2 %>%
left_join(x, by = c("IID"))
# Combining the two phenotype files together.
all_pheno_prs <- all_pheno_prs %>%
mutate(GOUT = as.logical(case_when(GOUT == "Gout" ~ TRUE,
GOUT == "Control" ~ FALSE))) %>%
full_join(all_pheno_prs2)
# Saving the combined phenotype file with the addition of the various PRS metrics.
save(all_pheno_prs, file = path(scratch_path, "Output/all_pheno_prs.RData"))
# Cleaning up.
rm(all_pheno_prs2, map, x, x1, i, ids, num_cols, tmp, ukbb_pheno, tmp1, all_pheno_dedup, different, tin, tin2, all_pheno, Combined_Gene_OR, tmp_gene_or)
The following code is for testing our definition of tophaceous disease, at the request of a co-author. This has not yet been completed (as of 19th August).
# Investigating tophus definitions.
# Loading CoreExome QC 1-10 phenotype file into R (this was made by Tanya).
CoreExPheno <- read_delim("/Volumes/archive/merrimanlab/raid_backup/New_Zealand_Chip_data/CoreExome/QC_MergedBatches/Final_Data/CZ-MB1.2-QC1.10_MergedPhenotypes_20082020.txt", delim = "\t") %>%
mutate(across(where(is_character), factor))
# Loading IDs of individuals genotyped on the CoreExome chip.
All_CoreEx_ID <- read_delim("/Volumes/archive/merrimanlab/raid_backup/New_Zealand_Chip_data/CoreExome/QC_MergedBatches/Final_Data/CZ-MB1.2-QC1.10_CoreExome24-1.0-3_genotyped-QCd_rsIDconverted.fam", delim = " ", col_names = F)
# Extracting information on the genotyped individuals from European cohorts of interest.
CoreExPheno_Euro <- CoreExPheno %>%
filter(Geno.BroadAncestry == "European",
Geno.SampleID %in% All_CoreEx_ID$X2,
General.Use != "No",
!(Pheno.Study %in% c("Auckland Controls", "Australian Controls", "ESR", "Rheumatoid Arthritis")))
# Extracting information on the genotyped individuals from Polynesian cohorts of interest.
CoreExPheno_Poly <- CoreExPheno %>%
filter(Geno.BroadAncestry == "Oceanian",
Geno.SampleID %in% All_CoreEx_ID$X2,
General.Use != "No",
!(Pheno.Study %in% c("ESR", "Pacific Trust")))
# Combining the above European and Polynesian cohort information together into one phenotype file, excluding individuals with unknown genetic sex, and excluding those without gout phenotype information. Columns of interest were extracted at this step.
CoreExPheno_Final <- full_join(CoreExPheno_Euro, CoreExPheno_Poly) %>%
filter(Geno.GeneticSex != "Unknown",
!is.na(Pheno.GoutSummary)) %>%
mutate(Pheno.GoutSummary = factor(case_when(Pheno.GoutSummary == "Gout" ~ "Gout",
Pheno.GoutSummary %in% c("Control", "HyperU") ~ "Control")),
across(where(is.factor), factor)) %>%
select(Pheno.SampleID:Pheno.UrateTherapy, GenStudio.ChipType, GenStudio.CallRate:Notes)
# Cleaning up.
rm(CoreExPheno, CoreExPheno_Euro, CoreExPheno_Poly, All_CoreEx_ID)
# Making functions for converting Boolean variables into TRUE/FALSE from 2/1 or 1/0.
logicfactor <- function(x) {
as.logical(factor(x, levels = c(1, 2), labels = c("FALSE", "TRUE")))
}
logicfactor2 <- function(x) {
as.logical(factor(x, levels = c(0, 1), labels = c("FALSE", "TRUE")))
}
# Gout in Aotearoa.
# Making temporary phenotype file for the Gout in Aotearoa cohort based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "Gout in Aotearoa")
# Extracting IDs from the SNPmax Gout in Aotearoa phenotype file based on the CoreExome file. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
aotearoa_pheno <- read_delim(here("Data/Phenotypes/NZPheno.txt"), delim = "\t", guess_max = 5000) %>%
filter(SUBJECT %in% tmp$Pheno.SampleID) %>%
select(SUBJECT, DATEARR, DOB, AGECOL, DIABETES, FAMGOUT, FAMGOUT3:HIBP, HIBPTREAT:FRUSEMIDE, BUMETANIDE, THIAZIDEDIURETIC:BENDROFLUAZIDE, HCTZ, METOLAZONE, CHLORHALIDONE, INDAPAMIDE, OTHDIURETIC, SPIRONOLACTONE, AMILORIDE, ACETAZOLAMIDE, DIURETICCOMMENT:DIURRECRUIT, LIPIDS, LIPIDLOWER:BILEACIDSEQ, HEART:STROKE, KIDNEY:HEALTHOTH, SUGDRINK, SMOKER:OTHALCO, WEIGHT:HEIGHT, BMI:BMICALC, MRURATE:MRCREATDATE, GOUTCRITERIAB, SUSTOPHUS:DIURGOUT, ALLOPCURRENT, PROBENCURRENT, BENZBROCURRENT, FEBUXCURRENT, OTHULTCURRENT, CURULTCOMMENT:ALLOPINTOLERANCE, ALLOPSIDE, URATEDOX:HIGHESTSUDATE, CHOLES:TRIGLY, SCREAT:SURICACID, URATE1MONTH, RELATEDFILTER:RELATED) %>%
left_join(tmp, by = c("SUBJECT" = "Pheno.SampleID")) %>%
rename(IID = SUBJECT,
GOUT = Pheno.GoutSummary,
SEX = Geno.GeneticSex) %>%
mutate(across(where(is_character), factor),
across(c(FAMGOUT, FAMGOUT3, HIBP, DIURETIC:ACETAZOLAMIDE, LIPIDS:KIDNEY, FATTYLIVER, GOUTCRITERIAB:SUSTOPHUS, TOPHUS, ALLOPCURRENT:OTHULTCURRENT, ALLOPINTOLERANCE),
logicfactor),
AGE1ATK = case_when(!is.na(AGE1ATK) ~ AGE1ATK,
TRUE ~ AGECOL - DURATION),
DURATION = AGECOL - AGE1ATK + 1,
NUMATK = NUMATK,
TOPHIGOUT = GOUTCRITERIAB | SUSTOPHUS | TOPHUS,
URATE = case_when(is.na(SURICACID) ~ rowMeans(across(c(MRURATE, URATEDOX, PREULTURATE, HIGHESTSU, URATE1MONTH)), na.rm = TRUE) * 1000 / 59.48,
TRUE ~ SURICACID * 1000 / 59.48),
ULT = ALLOPCURRENT | PROBENCURRENT | BENZBROCURRENT | FEBUXCURRENT | OTHULTCURRENT) %>%
select(IID, GOUT, AGECOL, SEX, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# AGRIA.
# Making temporary phenotype file for the AGRIA cohort based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "AGRIA")
# Extracting IDs from the SNPmax AGRIA phenotype file based on the CoreExome file. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
agria_pheno <- read_delim(here("Data/Phenotypes/AGRIAPheno.txt"), delim = "\t") %>%
filter(PATIENT %in% tmp$Pheno.SampleID) %>%
left_join(tmp, by = c("PATIENT" = "Pheno.SampleID")) %>%
rename(IID = PATIENT,
GOUT = Pheno.GoutSummary) %>%
mutate(across(where(is_character), factor),
across(c(DIABETES:KIDNEY, ALLOP, PROBEN, STEROID, ANTIINFLAM, COLCHI, GPGOUT:SUSTOPHUS, FAMGOUT, FAMGOUT3, FOOD, FULLCAUGTAFF:TOPHIGOUT, URATELOWERING),
logicfactor),
TOPHUS = case_when(TOPHUS == 2 ~ TRUE,
TOPHUS == 1 ~ FALSE),
SSBCODE = factor(SSBCODE,
levels = 0:5,
labels = c("0/day", "0.1 - 0.99", "1.0 - 1.99", "2.0 - 2.99", "3.0 - 3.99", "4.0 +")),
FRUITCODE = factor(FRUITCODE,
levels = 0:5,
labels = c("0/day", "0.1 - 0.99", "1.0 - 1.99", "2.0 - 2.99", "3.0 - 3.99", "4.0 +")),
DIURETICSUMMARY = factor(DIURETICSUMMARY,
levels = 1:3,
labels = c("Not taking diuretics", "Taking diuretics", "Maybe taking diuretics")),
SEX = Geno.GeneticSex,
AGESERUM = round(as.duration(interval(DOB, SERUMDATE)) / as.duration(years(1)),
digits = 0),
AGESCL = round(as.duration(interval(DOB, SCLDATE)) / as.duration(years(1)),
digits = 0),
AGE1ATK = case_when(is.na(AGEGOUTDOX) ~ round(as.duration(interval(DOB, GOUTDOXDATE)) / as.duration(years(1)),
digits = 0),
TRUE ~ AGEGOUTDOX),
DURATION = AGECOL - AGE1ATK + 1,
TOPHIGOUT = case_when(COMMENT %in% c("No information, neither tophaceous or aspirate proven, Deceased",
"No information, neither tophaceous or aspirate proven",
"Gout, no tophi",
"No information, neither tophaceous or aspirate proven, lymphoma") ~ FALSE,
TRUE ~ TOPHUS | GOUTCRITERIAB | SUSTOPHUS | COMMENT %in% c("Tophaceous",
"Urate crystals present, tophaceous",
"Aspirate proven, tophacous",
"allopurinol intolerant, febuxostat intolerant, taking benzobromarone. Urate crystals present, tophacious",
"Tophaceous gout",
"Polyarticular tophaceous gout",
"Chronic tophaceous gout")),
EROSIONS = NA,
NUMATK = NA,
URATE1 = round(URATE * 1000 / 59.48, digits = 1),
URATEAGE1 = AGESERUM,
URATE2 = round(SURICACID_SCL * 1000 / 59.48, digits = 1),
URATEAGE2 = AGESCL,
URATE = case_when(!is.na(URATE1) ~ URATE1,
TRUE ~ URATE2),
ULT = case_when(is.na(URATE1) & !is.na(URATE2) ~ NA,
TRUE ~ ALLOP | PROBEN | COMMENT %in% c("allopurinol intolerant, febuxostat intolerant, taking benzobromarone. Urate crystals present, tophacious",
"Allopurinol hypersensitivity, Cholchicine induced diarrhoea, Febuxostat 40mg/day",
"febuxostat 40mg/day; liver toxicity with allopurinol")),
PROPHY = STEROID | ANTIINFLAM | COLCHI,
HYPERTENSION = case_when(!is.na(HIBP) ~ HIBP,
TRUE ~ DIURETICINDUCED == "Yes" | DIURETICSUMMARY == "Maybe taking diuretics"),
TRIGLY = TRIGLY_SCL * 88.57,
CHOLES = CHOLES_SCL * 38.67,
STROKE = NA,
HDL = HDL_SCL * 38.67,
CREAT = CREAT / 88.42,
SCREAT = SCREAT / 88.42,
CREAT2 = rowMeans(across(c(CREAT, SCREAT)), na.rm = TRUE),
EGFR = case_when(SEX == "Male" ~ 175 * (CREAT2 ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (CREAT2 ^ -1.154) * (AGECOL ^ -0.203) * 0.742),
KIDNEY = KIDNEY | !is.na(KIDNEY2) | EGFR < 60,
TOTALALC = rowSums(across(c(BEER, WINE, SPIRITS)), na.rm = TRUE),
CURSMOKE = NA,
FAMGOUT = FAMGOUT | FAMGOUT3,
FAMGOUTNUM = as.numeric(FAMGOUT4)) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Diabetes Mellitus (DM).
# Making temporary phenotype file for the DM cohort based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "Diabetes Mellitus")
# Extracting IDs from the SNPmax DM phenotype file based on the CoreExome file. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
dm_pheno <- read_delim(here("Data/Phenotypes/DMPheno.txt"), delim = "\t") %>%
filter(PATIENT %in% tmp$Pheno.SampleID) %>%
select(PATIENT, DOB, DATECOL, AGECOL, DIABETES:DIABETESTREAT, FAMGOUT:HIBPTREAT, LIPIDS, HEART:STROKE, KIDNEY:KIDNEY2, SUGDRINK, SMOKER:OTHALCO, WEIGHT, HEIGHT, BMI, URATE:CREAT, DIURETIC:OTHDIURETIC, LIPIDLOWER:BILEACIDSEQ, COMMENT, GOUTCRITERIAB, SUSTOPHUS:OTHDRUG, URATEDOX:DATEDOX, DIABETESAFFSTAT, KIDNEYTRANSPLANT, RENALDISEASE, FASTING:TRIGLY, SURICACID:EGFR) %>%
left_join(tmp, by = c("PATIENT" = "Pheno.SampleID")) %>%
rename(IID = PATIENT,
GOUT = Pheno.GoutSummary,
SEX = Geno.GeneticSex) %>%
mutate(across(where(is_character), factor),
across(c(DIABETES, FAMGOUT, FAMGOUT3, HIBP, LIPIDS, HEART:STROKE, KIDNEY, DIURETIC:OTHDIURETIC, LIPIDLOWER:BILEACIDSEQ, GOUTCRITERIAB, SUSTOPHUS, TOPHUS, ALLOP:COLCHI, DIABETESAFFSTAT, KIDNEYTRANSPLANT, RENALDISEASE), logicfactor),
DURATION = AGECOL - AGE1ATK + 1,
TOPHIGOUT = TOPHUS | GOUTCRITERIAB | SUSTOPHUS,
EROSIONS = NA,
URATE = case_when(!is.na(SURICACID) ~ SURICACID * 1000 / 59.48,
!is.na(URATE) ~ URATE * 1000 / 59.48,
TRUE ~ URATEDOX * 1000 / 59.48),
ULT = ALLOP | PROBEN,
PROPHY = STEROID | ANTIINFLAM | COLCHI | OTHDRUG != "no",
HEIGHT = HEIGHT / 100,
BMI = case_when(!is.na(BMI) ~ BMI,
TRUE ~ WEIGHT / (HEIGHT * HEIGHT)),
HYPERTENSION = HIBP | !is.na(HIBPTREAT) | DIURETIC | LOOPDIURETIC | THIAZIDEDIURETIC | OTHDIURETIC | DIURGOUT,
DIABETES = DIABETES | !is.na(DIABETESTREAT) | DIABETESAFFSTAT,
HEART = HEART | ANGINA | HEARTFAILURE | HEARTSURGERY | HEARTATTACK,
CREAT = CREAT / 88.42,
SCREAT = SCREAT / 88.42,
CREAT2 = rowMeans(across(c(CREAT, SCREAT)), na.rm = T),
EGFR = case_when(SEX == "Male" ~ 175 * (CREAT2 ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (CREAT2 ^ -1.154) * (AGECOL ^ -0.203) * 0.742,
TRUE ~ EGFR),
KIDNEY = KIDNEY | !is.na(KIDNEY2) | EGFR < 60 | KIDNEYTRANSPLANT | RENALDISEASE,
LIPIDS = LIPIDS | LIPIDLOWER | STATIN | FIBRATES | EZETIMIBE | NICOTINICACID | BILEACIDSEQ,
STROKE = STROKE,
TOTALALC = rowSums(across(c(BEER, WINE, SPIRITS)), na.rm = TRUE),
SUGDRINK = SUGDRINK,
CURSMOKE = SMOKER == 2,
FAMGOUT = FAMGOUT | FAMGOUT3,
FAMGOUTNUM = FAMGOUT4) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# LPA.
# Making temporary phenotype file for the LPA cohort based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "LPA")
# Extracting IDs from the SNPmax LPA phenotype file based on the CoreExome file. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
lpa_pheno <- read_delim(here("Data/Phenotypes/LPAPheno.txt"), delim = "\t") %>%
filter(SUBJECT %in% tmp$Pheno.SampleID) %>%
select(SUBJECT:AGE, SMOKING, SMOKEHISTORY, SUGARDRINKS:DIABETESTYPE, MAINHYPERTENSION:DYSLIPIDCOMMENT, MAINSTROKE:MAINSTROKECOM, BMHEIGHT:BMWEIGHT, SERUMCREATININE:SERUMURATE, TOTALCHOLESTEROL, TRIGLYCERIDES) %>%
left_join(tmp, by = c("SUBJECT" = "Pheno.SampleID")) %>%
rename(IID = SUBJECT,
GOUT = Pheno.GoutSummary,
SEX = Geno.GeneticSex,
AGECOL = AGE) %>%
mutate(across(where(is_character), factor),
across(c(SMOKING:SMOKEHISTORY, MAINDIABETES, MAINHYPERTENSION, DYSLIPIDEMIA),
logicfactor),
AGE1ATK = NA,
DURATION = NA,
NUMATK = NA,
TOPHIGOUT = NA,
EROSIONS = NA,
URATE = SERUMURATE,
ULT = NA,
PROPHY = NA,
HEIGHT = BMHEIGHT / 100,
BMI = BMWEIGHT / (HEIGHT * HEIGHT),
HYPERTENSION = MAINHYPERTENSION,
DIABETES = DIABETESTYPE == 2,
HEART = NA,
CREAT = SERUMCREATININE / 88.42,
EGFR = case_when(SEX == "Male" ~ 175 * (SERUMCREATININE ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (SERUMCREATININE ^ -1.154) * (AGECOL ^ -0.203) * 0.742),
KIDNEY = EGFR < 60,
LIPIDS = DYSLIPIDEMIA,
STROKE = MAINSTROKE,
TOTALALC = NA,
SUGDRINK = SUGARDRINKS,
CURSMOKE = SMOKING,
FAMGOUT = NA,
FAMGOUTNUM = NA) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Ngati Porou (NPH).
# Making temporary phenotype file for the NPH cohort based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "Ngati Porou")
# Extracting IDs from the SNPmax NPH phenotype file based on the CoreExome file. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
nph_pheno <- read_delim(here("Data/Phenotypes/NPHPheno.txt"), delim = "\t") %>%
filter(PATIENT %in% tmp$Pheno.SampleID) %>%
select(PATIENT, DOB, CONSENT, DATEARR, AGECOL, DIABETES, FAMGOUT:HIBP, LIPIDS, LIPIDLOWER:STROKE, KIDNEY, SUGDRINK, SMOKER:SPIRITS, WEIGHT:HEIGHT, BMI, URATE:CREATDATE, DIURETICCURRENT:FRUSEMIDE, BUMETANIDE, BENDROFLUAZIDE, HCTZ, METOLAZONE, CHLORHALIDONE, SPIRONOLACTONE, AMILORIDE, COMMENT, GOUTCRITERIAB, SUSTOPHUS, AGE1ATK:ALLOP, STEROID:OTHDRUG, URATEDOX:DATEDOX, RENALTRANSPLANT, DIABETESAFFSTAT, SURICACID:SCREAT, DIURETIC:OTHDIURETIC, STATIN:BILEACIDSEQ, URATELOWERING) %>%
left_join(tmp, by = c("PATIENT" = "Pheno.SampleID")) %>%
rename(IID = PATIENT,
GOUT = Pheno.GoutSummary,
SEX = Geno.GeneticSex) %>%
mutate(across(where(is_character), factor),
across(c(DIABETES, FAMGOUT, FAMGOUT3, HIBP:STROKE, KIDNEY, DIURETICCURRENT:AMILORIDE, GOUTCRITERIAB, SUSTOPHUS, TOPHUS, ALLOP:BENZOBROMARONE, RENALTRANSPLANT, DIABETESAFFSTAT, DIURETIC:URATELOWERING),
logicfactor),
AGE1ATK = case_when(!is.na(AGE1ATK) ~ AGE1ATK,
TRUE ~ AGECOL - DURATION),
DURATION = AGECOL - AGE1ATK + 1,
NUMATK = NUMATK,
TOPHIGOUT = GOUTCRITERIAB | SUSTOPHUS | TOPHUS,
EROSIONS = NA,
URATE = case_when(is.na(SURICACID) ~ rowMeans(across(c(URATE, URATEDOX)), na.rm = TRUE) * 1000 / 59.48,
TRUE ~ SURICACID * 1000 / 59.48),
ULT = ALLOP | PROBEN | BENZOBROMARONE | URATELOWERING,
PROPHY = STEROID | ANTIINFLAM | COLCHI,
HEIGHT = HEIGHT / 100,
BMI = WEIGHT / (HEIGHT * HEIGHT),
HYPERTENSION = HIBP | DIURETICCURRENT | FRUSEMIDE | BUMETANIDE | BENDROFLUAZIDE | HCTZ | METOLAZONE | CHLORHALIDONE | SPIRONOLACTONE | AMILORIDE | DIURGOUT %in% 2:4 | DIURETIC | LOOPDIURETIC | THIAZIDEDIURETIC | OTHDIURETIC,
DIABETES = DIABETES | DIABETESAFFSTAT,
HEART = HEART | ANGINA | HEARTFAILURE | HEARTSURGERY | HEARTATTACK,
CREAT = rowMeans(across(c(CREAT, SCREAT)), na.rm = TRUE) / 88.42,
EGFR = case_when(SEX == "Male" ~ 175 * (CREAT ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (CREAT ^ -1.154) * (AGECOL ^ -0.203) * 0.742),
KIDNEY = KIDNEY | EGFR < 60 | RENALTRANSPLANT,
LIPIDS = LIPIDS | LIPIDLOWER | STATIN | FIBRATES | EZETIMIBE | NICOTINICACID | BILEACIDSEQ,
STROKE = STROKE,
TOTALALC = rowSums(across(c(BEER, WINE, SPIRITS)), na.rm = TRUE),
SUGDRINK = SUGDRINK,
CURSMOKE = SMOKER == 2,
FAMGOUT = FAMGOUT,
FAMGOUTNUM = FAMGOUT4) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Renal Disease (RD).
# Making temporary phenotype file for the RD cohort based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "Renal Disease")
# Extracting IDs from the SNPmax RD phenotype file based on the CoreExome file. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
rd_pheno <- read_delim(here("Data/Phenotypes/RDPheno.txt"), delim = "\t") %>%
filter(PATIENT %in% tmp$Pheno.SampleID) %>%
select(PATIENT, DOB, CONSENTDATE, DATECOL, DATEARR, CKDV, RENALTRANSPLANT, DIABETES, FAMGOUT, HYPERTENSION, DYSLIPIDAEMIA, IHD, CVA, CHF, HEALTHOTH:WEIGHT, BMI, SMOKER, SUGDRINK, BEER:SPIRITS, COMMENT, TYPE2D, GOUTCRITERIAB, SUSTOPHUS, AGE1ATK:OTHDRUG, ESSENTIALHYPERT, SURICACID:SCREAT, RCOMMENTS) %>%
left_join(tmp, by = c("PATIENT" = "Pheno.SampleID")) %>%
rename(IID = PATIENT,
GOUT = Pheno.GoutSummary,
SEX = Geno.GeneticSex) %>%
mutate(across(where(is_character), factor),
across(c(RENALTRANSPLANT:CHF, TYPE2D:SUSTOPHUS, TOPHUS, ALLOPURINOL:RASBURICASE),
logicfactor),
AGECOL = AGECOL,
AGE1ATK = AGE1ATK,
DURATION = AGECOL - AGE1ATK + 1,
NUMATK = NA,
TOPHIGOUT = GOUTCRITERIAB | SUSTOPHUS | TOPHUS,
EROSIONS = NA,
URATE = case_when(is.na(SURICACID) ~ rowMeans(across(c(URATEFIRSTREC, URATEDOX, URATERECENT)), na.rm = TRUE) * 1000 / 59.48,
TRUE ~ SURICACID * 1000 / 59.48),
ULT = ALLOPURINOL | PROBEN | RASBURICASE,
PROPHY = STEROID | ANTIINFLAM | COLCHI,
HEIGHT = HEIGHT / 100,
BMI = WEIGHT / (HEIGHT * HEIGHT),
HYPERTENSION = HYPERTENSION | ESSENTIALHYPERT == 1 | DIURGOUT %in% 2:4,
DIABETES = DIABETES | TYPE2D,
HEART = IHD | CHF,
EGFR = case_when(SEX == "Male" ~ 175 * (SCREAT / 88.42) ^ -1.154 * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (SCREAT / 88.42) ^ -1.154 * (AGECOL ^ -0.203) * 0.742),
KIDNEY = CKDV == 1 | RENALTRANSPLANT | EGFR < 60,
LIPIDS = DYSLIPIDAEMIA,
STROKE = CVA,
TOTALALC = rowSums(across(c(BEER, WINE, SPIRITS)), na.rm = TRUE),
SUGDRINK = SUGDRINK,
CURSMOKE = SMOKER == 2,
FAMGOUT = FAMGOUT,
FAMGOUTNUM = NA) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# EuroGout.
# Making temporary phenotype file for the EuroGout cohort based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "EuroGout")
# Extracting IDs from the SNPmax EuroGout phenotype file based on the CoreExome file. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
eurogout_pheno <- read_delim(here("Data/Phenotypes/EuroGoutPheno.txt"), delim = "\t", guess_max = 5000) %>%
filter(SUBJECT %in% tmp$Pheno.SampleID) %>%
select(SUBJECT, RECRUITMENTDATE, DOB:WEIGHT, HEIGHT, BMI, TOPHUS:GOUTNOTES, ACRB, ACRC8, RENALDISEASE, T2DIABETES:HEARTFAILURE, MEDICALCOMMENT, URATETHERAPY:ALLOPURINOL, CHOLCHICINE:TLDIURETICS, ASPRIN, SUGARDRINK:FRUITJUICE, ALCOHOL:PREUTLKURATE, TCHOLESTEROL:TRIGLYCERIDES, EGFR) %>%
left_join(tmp, by = c("SUBJECT" = "Pheno.SampleID")) %>%
rename(IID = SUBJECT,
GOUT = Pheno.GoutSummary,
AGECOL = AGERECRUITMENT,
SEX = Geno.GeneticSex) %>%
mutate(across(where(is_character), factor),
across(c(TOPHUS, EROSIONS, ACRB, ACRC8, RENALDISEASE, T2DIABETES, HYPERTENSION, DYSLIPIDEMIA, STROKE:HEARTFAILURE, ALLOPURINOL:ASPRIN),
logicfactor2),
AGE1ATK = case_when(!is.na(AGEFIRSTATTK) ~ AGEFIRSTATTK,
TRUE ~ AGECOL - DURATIONGOUT),
DURATION = AGECOL - AGE1ATK + 1,
NUMATK = case_when(!is.na(NUMATTACKS) ~ NUMATTACKS,
NUMATTACKS_TXT == ">5" ~ 5,
NUMATTACKS_TXT == "1" ~ 1,
NUMATTACKS_TXT == "2" ~ 2,
NUMATTACKS_TXT == "3" ~ 3,
NUMATTACKS_TXT %in% c("3 to 5", "3-5") ~ 4,
NUMATTACKS_TXT %in% c("reported 'continue' I think. I assume this means ongoing.", "reported 100.") ~ 52,
NUMATTACKS_TXT == "zehn" ~ 10),
TOPHIGOUT = TOPHUS | NUMTOPHI %in% 1:3 | ACRB | ACRC8,
URATE = case_when(is.na(SERUMURATE) ~ PREUTLKURATE * 1000 / 59.48,
TRUE ~ SERUMURATE * 1000 / 59.48),
ULT = GOUTNOTES == "Gout assumed, taking allopurinol" | (!is.na(URATETHERAPY) & !(URATETHERAPY %in% c("diet", "NIL", "no", "No uric acid lowering therapy", "none", "None", "NONE", "none listed", "Unclear"))) | ALLOPURINOL,
PROPHY = CHOLCHICINE | NSAIDS | ASPRIN,
HEIGHT = HEIGHT / 100,
BMI = case_when(!is.na(BMI) ~ BMI,
TRUE ~ WEIGHT / (HEIGHT * HEIGHT)),
HYPERTENSION = HYPERTENSION | !is.na(HYPERTENTREATM) | MEDICALCOMMENT == "Said no to hypertension but beside BP states is on losartan" | DIURETICS | TLDIURETICS,
DIABETES = T2DIABETES | !is.na(T2DTREATMENT),
HEART = MI | IHD | HEARTFAILURE | MEDICALCOMMENT %in% c("Cardiovascular disease", "Heart problems", "Heart problems. EGFR available", "Heart problems. EGFR available. EGFR available. EGFR<60"),
CREAT = SERUMCREATININE / 88.42,
EGFR = case_when(SEX == "Male" ~ 175 * (CREAT ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (CREAT ^ -1.154) * (AGECOL ^ -0.203) * 0.742,
TRUE ~ EGFR),
KIDNEY = RENALDISEASE | EGFR < 60,
LIPIDS = DYSLIPIDEMIA | !is.na(LIPIDTREATMENT),
STROKE = STROKE,
TOTALALC = ALCOHOL,
SUGDRINK = SUGARDRINK + FRUITJUICE,
CURSMOKE = SMOKER == 1,
FAMGOUT = FAMILYHISTORY == 1,
FAMGOUTNUM = NUMFAMILYGOUT) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Ardea - Ironwood (CLEAR 1, CLEAR 2, CRYSTAL, and LIGHT).
# Making temporary phenotype file for the Ironwood cohorts based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study %in% c("Ardea: CLEAR1", "Ardea: CLEAR2", "Ardea: CRYSTAL", "Ardea: LIGHT"))
# Extracting IDs from the SNPmax Ironwood phenotype file based on the CoreExome file. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
ironwood_pheno <- read_delim(here("Data/Phenotypes/ArdeaPheno.txt"), delim = "\t", guess_max = 5000) %>%
filter(SUBJID %in% tmp$Pheno.SampleID) %>%
select(SUBJID, AGE, BRTHDTC, BLWEIGHT, BLHEIGHT, BLBMI, TRT01AN, CONSDT, TRTSDT, ANGINA:HYPERTRIGLY, MI, STROKE, AGFIDDT:GFDUR, CRITBFL, PHNM8FL, ULTALLO:ULTOTH, PLACTOTSTDT:PLACTOTENDT, THIALKFL:PROPHTYPN, TOPHIFN:BLAREA, GFNUM:GFNUMGR, DATESCREENING:EGFRSCREENING, CHOLSCREENING, TRIGSCREENING, URATESCREENING, DATENEG7, URATENEG7, EGFRNEG7, DATEBASELINE, URATEBASELINE, EGFRBASELINE, DATEMONTH1, URATEMONTH1, DATEMONTH2, URATEMONTH2, DATEMONTH3, URATEMONTH3, DATEMONTH4, URATEMONTH4, DATEMONTH5, URATEMONTH5, DATEMONTH6, URATEMONTH6, DATEMONTH8, URATEMONTH8, DATEMONTH10, URATEMONTH10, DATEMONTH12, URATEMONTH12, DATEEARLYTERM, URATEEARLYTERM, DATEFOLLOWUP, URATEFOLLOWUP, CURSMOKE:ALCOHOL, TOPHIGOUT:GOUTNOTES) %>%
left_join(tmp, by = c("SUBJID" = "Pheno.SampleID")) %>%
mutate(across(where(is_character), factor),
across(c(ANGINA:STROKE, CRITBFL:ULTOTH, THIALKFL:PROPHYFL, TOPHIFN, CURSMOKE:ALCOHOL),
logicfactor2),
TRT01AN = factor(TRT01AN,
levels = 0:5,
labels = c("Screen Failure", "Group A (Placebo)", "Group B (Lesinurad 200 mg)", "Group C (Lesinurad 400 mg)", "Not Assigned", "Not Treated"))) %>%
rename(IID = SUBJID,
GOUT = Pheno.GoutSummary,
AGECOL = AGE) %>%
mutate(SEX = Geno.GeneticSex,
AGE1ATK = round(as.duration(interval(ymd(BRTHDTC, truncated = 2L), AGFIDDT)) / as.duration(years(1)),
digits = 0),
DURATION = AGECOL - AGE1ATK + 1,
TOPHIGOUT = TOPHIFN,
EROSIONS = NA,
NUMATK = GFNUM,
URATE = URATESCREENING,
ULT = ULTALLO | ULTPROB | ULTFEBU | ULTOTH | Pheno.Study %in% c("Ardea: CLEAR1", "Ardea: CLEAR2") | (Pheno.Study == "Ardea: CRYSTAL" & URATE < 8),
PROPHY = PROPHYFL,
BMI = BLBMI,
HEART = HEARTFAILURE | MI | ANGINA,
KIDNEY = EGFRSCREENING < 60,
LIPIDS = HYPERCHOLESTEROL | HYPERTRIGLY,
TOTALALC = NA,
SUGDRINK = NA,
FAMGOUT = NA,
FAMGOUTNUM = NA,
EGFR2 = case_when(SEX == "Male" ~ 175 * (SCRSCREENING ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (SCRSCREENING ^ -1.154) * (AGECOL ^ -0.203) * 0.742)) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Ardea - LASSO.
# Making temporary phenotype file for the Ironwood cohorts based on the CoreExome phenotype file.
tmp <- CoreExPheno_Final %>%
filter(Pheno.Study == "Ardea: 401")
# Reading in three LASSO cohort phenotype files.
lassopheno1 <- read_delim(here("Data/Phenotypes/ArdeaLassoPhenoFlare.txt"), delim = "\t", guess_max = 5000)
lassopheno2 <- read_delim(here("Data/Phenotypes/ArdeaLassoPhenoLabChem.txt"), delim = "\t", guess_max = 5000)
lassopheno3 <- read_delim(here("Data/Phenotypes/ArdeaLassoPhenoMain.txt"), delim = "\t", guess_max = 5000)
# Combining all three phenotype files together.
tmp1 <- full_join(lassopheno3, lassopheno2, by = "SUBJID")
lasso_pheno <- full_join(tmp1, lassopheno1, by = "SUBJID")
# Cleaning up.
rm(lassopheno1, lassopheno2, lassopheno3, tmp1)
# Extracting IDs from the SNPmax LASSO phenotype file based on the CoreExome file. Converting the DNAID into a character variable named IID. Selecting columns of interest. Adding CoreExome file columns. Renaming variables. Converting all character variables to factors. Converting boolean variables with the logicfactor function. Further cleaning up columns and producing new columns of interest. Finally selecting all columns of interest.
lasso_pheno <- lasso_pheno %>%
filter(DNAID %in% tmp$Pheno.SampleID) %>%
mutate(IID = as.character(DNAID)) %>%
select(IID, AGE, BRTHDTC, GFNUM:ULTOSCR, BLBMI:PROPHTYP, AGFIDDT:GFDUR, TOLOCL:GFDTDURL, ANGINA:RENALIMPAIR, MI:STROKE, SCREENGFSTDT, SCREENGFENDT, SCREENGFOUT, SCREENGFSEV, SCREENPAIN, SCREENGFDUR:SCREENPAIN2, SCREENGFSTRSTP, SCREENLBDT, SCREENALT, SCREENCREAT, SCREENGGT, SCREENURATE, BASELINELBDT, BASELINEURATE, BASET1LBDT, BASET1URATE, BASET2LBDT, BASET2URATE, BASET3LBDT, BASET3URATE, MONTH1LBDT, MONTH1URATE, MONTH1T1LBDT, MONTH1T1URATE, MONTH1T2LBDT, MONTH1T2URATE, MONTH2LBDT, MONTH2URATE, MONTH2T1LBDT, MONTH2T1URATE, MONTH3LBDT, MONTH3URATE, MONTH3T1LBDT, MONTH3T1URATE, MONTH3T3LBDT, MONTH3T3URATE, MONTH4LBDT, MONTH4URATE, MONTH4T1LBDT, MONTH4T1URATE, MONTH5LBDT, MONTH5URATE, MONTH6LBDT, MONTH6URATE, UNSCHEDLBDT, UNSCHEDURATE, EARLYTERMLBDT, EARLYTERMURATE) %>%
left_join(tmp, by = c("IID" = "Pheno.SampleID")) %>%
rename(GOUT = Pheno.GoutSummary,
AGECOL = AGE,
NUMATK = GFNUM,
SEX = Geno.GeneticSex,
BMI = BLBMI) %>%
mutate(across(where(is_character), factor),
across(c(TOHANDFL:ULTOSCR, BLCDFL, ANGINA:RENALIMPAIR, MI:STROKE),
logicfactor2),
AGE1ATK = round(as.duration(interval(ymd(BRTHDTC, truncated = 2L), AGFIDDT)) / as.duration(years(1)),
digits = 0),
DURATION = AGECOL - AGE1ATK + 1,
TOPHIGOUT = BLTPHFN,
EROSIONS = NA,
URATE = SCREENURATE,
ULT = ALLOSCR | ULTOSCR | SCREENURATE < 8,
PROPHY = PROPHTYP %in% c("Both", "Colchicine", "NSAID"),
HEART = ANGINA | MI,
EGFR = case_when(SEX == "Male" ~ 175 * (SCREENCREAT ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (SCREENCREAT ^ -1.154) * (AGECOL ^ -0.203) * 0.742),
KIDNEY = RENALIMPAIR | EGFR < 60,
LIPIDS = HYPERCHOLESTEROL | HYPERTRIGLY,
TOTALALC = NA,
SUGDRINK = NA,
CURSMOKE = NA,
FAMGOUT = NA,
FAMGOUTNUM = NA) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Combining phenotypes for all cohorts together. Additionally, deduplicating the IID column so as to keep only unique individuals.
all_pheno <- rbind(agria_pheno, aotearoa_pheno, dm_pheno, eurogout_pheno, ironwood_pheno, lasso_pheno, lpa_pheno, nph_pheno, rd_pheno) %>%
mutate(Pheno.Study = factor(Pheno.Study)) %>%
arrange(IID) %>%
filter(!(duplicated(IID) | duplicated(IID, fromLast = TRUE)))
# Cleaning up.
rm(aotearoa_pheno, agria_pheno, dm_pheno, nph_pheno, rd_pheno, eurogout_pheno, ironwood_pheno, lasso_pheno, tmp, lpa_pheno, CoreExPheno_Final)
# Extracting phenotype information for the UK Biobank cohort.
# Saving location of the phenotype data as a variable.
ukbb_dir <- path("/Volumes/archive/merrimanlab/raid_backup/UKbiobank/")
# Saving location of the coding files for phenotypes.
codings_dir <- path(here("Data/codings"))
# Reading in list of withdrawn IDs.
withdrawn <- read_csv(path(ukbb_dir, "w12611_20210201.new.csv"), col_names = "id")
# Loading UKBB phenotype file from 2019.
load(path(ukbb_dir,'decrypted_files/ukb27189_27190_27191_27192_27193_27194_27195_27640_30070_31460_combined_withdrawn_ids_removed_10-07-2019.RData'))
# Renaming working dataset and removing withdrawn and missing IDs (leaves 488,247 individuals).
bd_refined <- refresh_ukbb_data %>%
filter(!is.na(eid) & !(eid %in% withdrawn$id)) # Total 488,247
# Reading coding document for self-reported diseases.
datacoding6 <- read_delim(path(codings_dir, "coding6.tsv"), delim = "\t")
# Reading coding document for self-reported medication.
datacoding4 <- read_delim(path(codings_dir, "coding4.tsv"), delim = "\t")
# Reading Nicola Dalbeth annotated datacoding4 file.
nd_drugs <- read_csv(path(ukbb_dir, "UKBio_drugs_ND_formatted.csv"),
col_names = TRUE,
col_types = cols(.default = "c"))
# Reading coding document for ICD10 (hospital diagnoses).
datacoding19 <- read_delim(path(codings_dir, "coding19.tsv"), delim = "\t")
# Extracting columns of interest.
test <- refresh_ukbb_data %>%
select(eid, body_mass_ind_bmi_f21001_0_0, alcohol_intake_frequency_f1558_0_0, current_tobacco_smoking_f1239_0_0)
# Loading in medication phenotype data file.
load("/Volumes/userdata/student_users/nicksumpter/Documents/PhD/Cluster/Temp/self_report_med.RData")
# Extracting ULT column from medication phenotype file.
tmp <- self_report_med %>%
mutate(IID = eid,
ULT = allopurinol | sulphinpyrazone | probenecid) %>%
select(IID, ULT)
# Loading cleaned up UK Biobank phenotype file.
load("/Volumes/userdata/student_users/nicksumpter/Documents/PhD/Cluster/Temp/final_data.RData")
# Combining all UK Biobank phenotypes of interest into a single file.
ukbb_pheno <- final_data %>%
mutate(eid = as.numeric(eid)) %>%
left_join(test, by = "eid") %>%
rename(IID = eid,
GOUT = gout,
AGECOL = age,
SEX = sex,
URATE = urate) %>%
mutate(Geno.PCVector1 = NA,
Geno.PCVector2 = NA,
Geno.PCVector3 = NA,
Geno.PCVector4 = NA,
Geno.PCVector5 = NA,
Geno.PCVector6 = NA,
Geno.PCVector7 = NA,
Geno.PCVector8 = NA,
Geno.PCVector9 = NA,
Geno.PCVector10 = NA,
Geno.PCVector1_Oc = NA,
Geno.PCVector2_Oc = NA,
Geno.PCVector3_Oc = NA,
Geno.PCVector4_Oc = NA,
Geno.PCVector5_Oc = NA,
Geno.PCVector6_Oc = NA,
Geno.PCVector7_Oc = NA,
Geno.PCVector8_Oc = NA,
Geno.PCVector9_Oc = NA,
Geno.PCVector10_Oc = NA,
AGE1ATK = NA,
DURATION = NA,
TOPHIGOUT = NA,
EROSIONS = NA,
NUMATK = NA,
PROPHY = NA,
Geno.SpecificAncestry = "European",
BMI = body_mass_ind_bmi_f21001_0_0,
HYPERTENSION = hypertension,
DIABETES = type2_diabetes,
HEART = coronary_heart_disease | heart_failure,
KIDNEY = ckd_stage3 | ckd_stage4 | end_stage_renal,
LIPIDS = dyslipidemia,
STROKE = cerebrovascular_disease,
TOTALALC = case_when(alcohol_intake_frequency_f1558_0_0 == "Daily or almost daily" ~ 14,
alcohol_intake_frequency_f1558_0_0 == "Three or four times a week" ~ 4,
alcohol_intake_frequency_f1558_0_0 == "Once or twice a week" ~ 2,
TRUE ~ NA_real_),
SUGDRINK = NA,
CURSMOKE = current_tobacco_smoking_f1239_0_0 == "Yes, on most or all days",
FAMGOUT = NA,
FAMGOUTNUM = NA,
Geno.SampleID = NA,
Pheno.Study = "UK Biobank") %>%
left_join(tmp) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Cleaing up.
rm(final_control_data, final_data, final_gout_data, refresh_ukbb_data, self_report_med, test, tmp, ukbb_dir, logicfactor, logicfactor2)
Data Exploration
The purpose of this section is to explore the phenotype/PRS data that was generated in the previous section.
After loading in the data, I first produce a table of missing data for all of my variables of interest. I use this as evidence for filtering individuals. After filtering the data, and removing certain phenotypes that had too much missing data, I then re-observe the missingness and report the proportion for each variable.
Next, I produce a table that shows the distribution of all variables of interest in all cohorts. I then plot each of these distributions in all cohorts. Next, I plot the relationship between several combinations of variables. Finally, I plot the minor allele frequency of all SNPs in each cohort.
# Loading phenotype file.
load(path(scratch_path, "Output/all_pheno_prs.RData"))
# Loading gout and urate PRS files.
load(here("Output/UKBB_Gene_OR.RData"))
load(here("Output/Tin_Gene_OR.RData"))
# Making a categorical flare frequency variable (FLARE_CAT) and setting all control gout severity traits to NA.
all_pheno_prs <- all_pheno_prs %>%
mutate(FLARE_CAT = factor(case_when(between(NUMATK, 0, 5) ~ paste0(as.character(NUMATK), " flares in last year"),
between(NUMATK, 6, 11) ~ "One every one to two months",
between(NUMATK, 12, 52) ~ "One or more per month"),
levels = c(paste0(0:5, " flares in last year"),
"One every one to two months",
"One or more per month"),
labels = c(paste0(0:5),
"6 - 11",
"12 - 52"),
ordered = TRUE),
AGE1ATK = case_when(GOUT ~ AGE1ATK),
DURATION = case_when(GOUT ~ DURATION),
NUMATK = case_when(GOUT ~ NUMATK),
TOPHIGOUT = case_when(GOUT ~ TOPHIGOUT),
ULT = case_when(GOUT ~ ULT))
# Making variable for naming all cohort subsets.
cohortstring <- c("UK Biobank - Gout - Male",
"UK Biobank - Gout - Female",
"UK Biobank - Control - Male",
"UK Biobank - Control - Female",
"Aus/NZ European - Gout - Male",
"Aus/NZ European - Gout - Female",
"Aus/NZ European - Control - Male",
"Aus/NZ European - Control - Female",
"GlobalGout - Gout - Male",
"GlobalGout - Gout - Female",
"GlobalGout - Control - Male",
"GlobalGout - Control - Female",
"Ardea - LASSO - Male",
"Ardea - LASSO - Female",
"Ardea - CLEAR 1 - Male",
"Ardea - CLEAR 1 - Female",
"Ardea - CLEAR 2 - Male",
"Ardea - CLEAR 2 - Female",
"Ardea - CRYSTAL - Male",
"Ardea - CRYSTAL - Female",
"Ardea - LIGHT - Male",
"Ardea - LIGHT - Female",
"East Polynesian - Gout - Male",
"East Polynesian - Gout - Female",
"East Polynesian - Control - Male",
"East Polynesian - Control - Female",
"East Polynesian - Gout - Male - NP",
"East Polynesian - Gout - Female - NP",
"East Polynesian - Control - Male - NP",
"East Polynesian - Control - Female - NP",
"West Polynesian - Gout - Male",
"West Polynesian - Gout - Female",
"West Polynesian - Control - Male",
"West Polynesian - Control - Female")
# Making list with each of the cohort subsets as listed in the cohortstring object. This will make it easier to manipulate the data for tables etc.
data_list <- list(all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Pheno.Study == "UK Biobank"),
all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Pheno.Study == "UK Biobank"),
all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Pheno.Study == "UK Biobank"),
all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Pheno.Study == "UK Biobank"),
all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs %>% filter(SEX == "Male",
Pheno.Study == "Ardea: 401",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs %>% filter(SEX == "Female",
Pheno.Study == "Ardea: 401",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs %>% filter(SEX == "Male",
Pheno.Study == "Ardea: CLEAR1",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs %>% filter(SEX == "Female",
Pheno.Study == "Ardea: CLEAR1",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs %>% filter(SEX == "Male",
Pheno.Study == "Ardea: CLEAR2",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs %>% filter(SEX == "Female",
Pheno.Study == "Ardea: CLEAR2",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs %>% filter(SEX == "Male",
Pheno.Study == "Ardea: CRYSTAL",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs %>% filter(SEX == "Female",
Pheno.Study == "Ardea: CRYSTAL",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs %>% filter(SEX == "Male",
Pheno.Study == "Ardea: LIGHT",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs %>% filter(SEX == "Female",
Pheno.Study == "Ardea: LIGHT",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")),
all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")),
all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")),
all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")))
# Making string of cohort subset names with HTML code to make for pretty printing on figures.
clean_names <- c("UK Biobank<br>Gout<br>Male",
"UK Biobank<br>Gout<br>Female",
"UK Biobank<br>Control<br>Male",
"UK Biobank<br>Control<br>Female",
"Aus/NZ European<br>Gout<br>Male",
"Aus/NZ European<br>Gout<br>Female",
"Aus/NZ European<br>Control<br>Male",
"Aus/NZ European<br>Control<br>Female",
"GlobalGout<br>Gout<br>Male",
"GlobalGout<br>Gout<br>Female",
"GlobalGout<br>Control<br>Male",
"GlobalGout<br>Control<br>Female",
"Ardea<br>LASSO<br>Gout<br>Male",
"Ardea<br>LASSO<br>Gout<br>Female",
"Ardea<br>CLEAR1<br>Gout<br>Male",
"Ardea<br>CLEAR1<br>Gout<br>Female",
"Ardea<br>CLEAR2<br>Gout<br>Male",
"Ardea<br>CLEAR2<br>Gout<br>Female",
"Ardea<br>CRYSTAL<br>Gout<br>Male",
"Ardea<br>CRYSTAL<br>Gout<br>Female",
"Ardea<br>LIGHT<br>Gout<br>Male",
"Ardea<br>LIGHT<br>Gout<br>Female",
"East Polynesian<br>Gout<br>Male",
"East Polynesian<br>Gout<br>Female",
"East Polynesian<br>Control<br>Male",
"East Polynesian<br>Control<br>Female",
"East Polynesian<br>Gout<br>Male<br>NP",
"East Polynesian<br>Gout<br>Female<br>NP",
"East Polynesian<br>Control<br>Male<br>NP",
"East Polynesian<br>Control<br>Female<br>NP",
"West Polynesian<br>Gout<br>Male",
"West Polynesian<br>Gout<br>Female",
"West Polynesian<br>Control<br>Male",
"West Polynesian<br>Control<br>Female")
# Defining functions for use throughout this section.
# This function will take any numeric variable and produce a report of the form "<mean> ± <sd>".
report <- function(x) {
if(sum(is.na(x)) != length(x)) {
paste0(sprintf(mean(x, na.rm = TRUE), fmt = "%#.1f"), " ± ", sprintf(sd(x, na.rm = TRUE), fmt = "%#.1f"))
} else {
paste0("NA")
}
}
# This function will take any numeric variable and produce a report of the form "<median> (<lower quartile> - <upper quartile>)".
report_median <- function(x) {
if(sum(is.na(x)) != length(x)) {
paste0(median(x, na.rm =T), " (", summary(x)[[2]], " - ", summary(x)[[5]], ")")
} else {
paste0("NA")
}
}
# This function will take any TRUE/FALSE variable and produce a report of the form "<N TRUE> (<% TRUE>)".
sumreport <- function(x) {
if(sum(is.na(x)) != length(x)){
paste0(sum(x, na.rm = TRUE), " (", sprintf((mean(x, na.rm = TRUE) * 100), fmt = "%#.1f"), ")")
} else {
paste0("NA")
}
}
# This function transposes a dataframe.
transpose_df <- function(df) {
t_df <- data.table::transpose(df)
colnames(t_df) <- rownames(df)
rownames(t_df) <- colnames(df)
t_df <- t_df %>%
rownames_to_column() %>%
as_tibble() %>%
row_to_names(row_number = 1)
return(t_df)
}
# This function will take any variable and report the missingness as either "All", "None", or the form "<N missing> (<% missing>)".
missing <- function(x){
if(sum(is.na(x)) == length(x)) {
return("All")
} else if(sum(!is.na(x)) == length(x)){
return("None")
} else {
paste0(format(sum(is.na(x)), big.mark = ","), " (", format(round((sum(is.na(x)) / length(x) * 100), digits = 1), nsmall = 1), ")")
}
}
Table of Missing Data Percentages
# Making table with missing data statistics for each cohort.
table1 <- tibble("Cohort" = cohortstring,
"N" = unlist(lapply(data_list, function(x) format(nrow(x), big.mark = ","))),
"Age at Recruitment" = unlist(lapply(data_list, function(x) missing(x$AGECOL))),
"Serum Urate" = unlist(lapply(data_list, function(x) missing(x$URATE))),
"ULT" = unlist(lapply(data_list, function(x) missing(x$ULT))),
"Age at Onset" = unlist(lapply(data_list, function(x) missing(x$AGE1ATK))),
"Disease Duration" = unlist(lapply(data_list, function(x) missing(x$DURATION))),
"Flare Frequency" = unlist(lapply(data_list, function(x) missing(x$NUMATK))),
"Tophaceous Disease" = unlist(lapply(data_list, function(x) missing(x$TOPHIGOUT))),
"PRS" = unlist(lapply(data_list, function(x) missing(x$PRS))),
"Urate PRS" = unlist(lapply(data_list, function(x) missing(x$Urate_PRS))),
"Prophylaxis" = unlist(lapply(data_list, function(x) missing(x$PROPHY))),
"BMI" = unlist(lapply(data_list, function(x) missing(x$BMI))),
"Hypertension" = unlist(lapply(data_list, function(x) missing(x$HYPERTENSION))),
"Type 2 Diabetes" = unlist(lapply(data_list, function(x) missing(x$DIABETES))),
"Heart Disease" = unlist(lapply(data_list, function(x) missing(x$HEART))),
"Kidney Disease" = unlist(lapply(data_list, function(x) missing(x$KIDNEY))),
"Dyslipidemia" = unlist(lapply(data_list, function(x) missing(x$LIPIDS))),
"Stroke" = unlist(lapply(data_list, function(x) missing(x$STROKE))),
"Alcoholic Drinks / Week" = unlist(lapply(data_list, function(x) missing(x$TOTALALC))),
"Sugar-Sweetened Drinks / Week" = unlist(lapply(data_list, function(x) missing(x$SUGDRINK))),
"Current Smoker" = unlist(lapply(data_list, function(x) missing(x$CURSMOKE))),
"Family History of Gout" = unlist(lapply(data_list, function(x) missing(x$FAMGOUT))),
"No. Relatives w/ Gout" = unlist(lapply(data_list, function(x) missing(x$FAMGOUTNUM))))
# Transposing table and preparing it for display as an html.
table1 <- transpose_df(table1) %>%
column_to_rownames(var = "Cohort") %>%
mutate(across(.cols = 1:ncol(table1), ~ str_replace_all(string = .x, pattern = " ", replacement = " ")))
# Further preparation of table.
row.names(table1) <- str_replace_all(row.names(table1), " ", " ")
# Printing table using the kable function.
table1 %>%
kable(col.names = clean_names,
align = "c",
escape = F) %>%
kable_styling("striped") %>%
scroll_box(width = "800px", height = "490px") %>%
footnote("'All' = all missing, 'None' = none missing")
|
|
UK Biobank Gout Male
|
UK Biobank Gout Female
|
UK Biobank Control Male
|
UK Biobank Control Female
|
Aus/NZ European Gout Male
|
Aus/NZ European Gout Female
|
Aus/NZ European Control Male
|
Aus/NZ European Control Female
|
GlobalGout Gout Male
|
GlobalGout Gout Female
|
GlobalGout Control Male
|
GlobalGout Control Female
|
Ardea LASSO Gout Male
|
Ardea LASSO Gout Female
|
Ardea CLEAR1 Gout Male
|
Ardea CLEAR1 Gout Female
|
Ardea CLEAR2 Gout Male
|
Ardea CLEAR2 Gout Female
|
Ardea CRYSTAL Gout Male
|
Ardea CRYSTAL Gout Female
|
Ardea LIGHT Gout Male
|
Ardea LIGHT Gout Female
|
East Polynesian Gout Male
|
East Polynesian Gout Female
|
East Polynesian Control Male
|
East Polynesian Control Female
|
East Polynesian Gout Male NP
|
East Polynesian Gout Female NP
|
East Polynesian Control Male NP
|
East Polynesian Control Female NP
|
West Polynesian Gout Male
|
West Polynesian Gout Female
|
West Polynesian Control Male
|
West Polynesian Control Female
|
|
N
|
6,584
|
547
|
152,769
|
172,450
|
1,040
|
227
|
769
|
623
|
1,627
|
221
|
45
|
84
|
826
|
66
|
231
|
16
|
241
|
7
|
178
|
4
|
110
|
8
|
421
|
133
|
243
|
385
|
128
|
29
|
44
|
37
|
457
|
62
|
203
|
184
|
|
Age at Recruitment
|
None
|
None
|
None
|
None
|
1 (0.1)
|
None
|
None
|
None
|
149 (9.2)
|
18 (8.1)
|
1 (2.2)
|
5 (6.0)
|
5 (0.6)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
1 (0.5)
|
|
Serum Urate
|
307 (4.7)
|
29 (5.3)
|
7,152 (4.7)
|
8,379 (4.9)
|
34 (3.3)
|
8 (3.5)
|
36 (4.7)
|
25 (4.0)
|
309 (19.0)
|
53 (24.0)
|
33 (73.3)
|
76 (90.5)
|
4 (0.5)
|
1 (1.5)
|
1 (0.4)
|
None
|
1 (0.4)
|
None
|
None
|
None
|
None
|
None
|
4 (1.0)
|
1 (0.8)
|
35 (14.4)
|
65 (16.9)
|
1 (0.8)
|
1 (3.4)
|
None
|
None
|
2 (0.4)
|
2 (3.2)
|
22 (10.8)
|
17 (9.2)
|
|
ULT
|
None
|
None
|
All
|
All
|
435 (41.8)
|
111 (48.9)
|
All
|
All
|
706 (43.4)
|
108 (48.9)
|
All
|
All
|
5 (0.6)
|
1 (1.5)
|
None
|
None
|
None
|
None
|
81 (45.5)
|
3 (75.0)
|
None
|
None
|
135 (32.1)
|
38 (28.6)
|
All
|
All
|
23 (18.0)
|
2 (6.9)
|
All
|
All
|
134 (29.3)
|
19 (30.6)
|
All
|
All
|
|
Age at Onset
|
All
|
All
|
All
|
All
|
105 (10.1)
|
41 (18.1)
|
All
|
All
|
607 (37.3)
|
99 (44.8)
|
All
|
All
|
5 (0.6)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
40 (9.5)
|
18 (13.5)
|
All
|
All
|
3 (2.3)
|
1 (3.4)
|
All
|
All
|
33 (7.2)
|
16 (25.8)
|
All
|
All
|
|
Disease Duration
|
All
|
All
|
All
|
All
|
105 (10.1)
|
41 (18.1)
|
All
|
All
|
608 (37.4)
|
99 (44.8)
|
All
|
All
|
5 (0.6)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
40 (9.5)
|
18 (13.5)
|
All
|
All
|
3 (2.3)
|
1 (3.4)
|
All
|
All
|
33 (7.2)
|
16 (25.8)
|
All
|
All
|
|
Flare Frequency
|
All
|
All
|
All
|
All
|
173 (16.6)
|
51 (22.5)
|
All
|
All
|
654 (40.2)
|
99 (44.8)
|
All
|
All
|
5 (0.6)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
45 (10.7)
|
23 (17.3)
|
All
|
All
|
5 (3.9)
|
3 (10.3)
|
All
|
All
|
36 (7.9)
|
16 (25.8)
|
All
|
All
|
|
Tophaceous Disease
|
All
|
All
|
All
|
All
|
263 (25.3)
|
58 (25.6)
|
All
|
All
|
1,063 (65.3)
|
146 (66.1)
|
All
|
All
|
5 (0.6)
|
None
|
2 (0.9)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
72 (17.1)
|
39 (29.3)
|
All
|
All
|
52 (40.6)
|
8 (27.6)
|
All
|
All
|
55 (12.0)
|
12 (19.4)
|
All
|
All
|
|
PRS
|
374 (5.7)
|
25 (4.6)
|
9,195 (6.0)
|
10,206 (5.9)
|
9 (0.9)
|
2 (0.9)
|
5 (0.7)
|
4 (0.6)
|
127 (7.8)
|
19 (8.6)
|
None
|
None
|
2 (0.2)
|
1 (1.5)
|
1 (0.4)
|
None
|
2 (0.8)
|
None
|
3 (1.7)
|
None
|
1 (0.9)
|
None
|
None
|
2 (1.5)
|
2 (0.8)
|
3 (0.8)
|
2 (1.6)
|
None
|
None
|
None
|
6 (1.3)
|
1 (1.6)
|
4 (2.0)
|
3 (1.6)
|
|
Urate PRS
|
2,226 (33.8)
|
172 (31.4)
|
51,780 (33.9)
|
58,626 (34.0)
|
55 (5.3)
|
9 (4.0)
|
23 (3.0)
|
93 (14.9)
|
187 (11.5)
|
27 (12.2)
|
1 (2.2)
|
None
|
30 (3.6)
|
3 (4.5)
|
5 (2.2)
|
None
|
15 (6.2)
|
None
|
6 (3.4)
|
None
|
4 (3.6)
|
None
|
14 (3.3)
|
6 (4.5)
|
10 (4.1)
|
26 (6.8)
|
10 (7.8)
|
2 (6.9)
|
None
|
3 (8.1)
|
21 (4.6)
|
3 (4.8)
|
11 (5.4)
|
12 (6.5)
|
|
Prophylaxis
|
All
|
All
|
All
|
All
|
971 (93.4)
|
208 (91.6)
|
768 (99.9)
|
All
|
783 (48.1)
|
125 (56.6)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
361 (85.7)
|
118 (88.7)
|
242 (99.6)
|
All
|
2 (1.6)
|
3 (10.3)
|
43 (97.7)
|
All
|
405 (88.6)
|
53 (85.5)
|
All
|
183 (99.5)
|
|
BMI
|
23 (0.3)
|
3 (0.5)
|
485 (0.3)
|
493 (0.3)
|
79 (7.6)
|
22 (9.7)
|
175 (22.8)
|
53 (8.5)
|
502 (30.9)
|
85 (38.5)
|
All
|
All
|
10 (1.2)
|
None
|
1 (0.4)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
9 (2.1)
|
5 (3.8)
|
3 (1.2)
|
13 (3.4)
|
3 (2.3)
|
None
|
4 (9.1)
|
6 (16.2)
|
13 (2.8)
|
5 (8.1)
|
3 (1.5)
|
9 (4.9)
|
|
Hypertension
|
None
|
None
|
None
|
None
|
408 (39.2)
|
39 (17.2)
|
489 (63.6)
|
324 (52.0)
|
855 (52.6)
|
120 (54.3)
|
All
|
All
|
None
|
None
|
1 (0.4)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
128 (30.4)
|
18 (13.5)
|
172 (70.8)
|
253 (65.7)
|
32 (25.0)
|
3 (10.3)
|
32 (72.7)
|
28 (75.7)
|
205 (44.9)
|
15 (24.2)
|
161 (79.3)
|
139 (75.5)
|
|
Type 2 Diabetes
|
678 (10.3)
|
55 (10.1)
|
16,611 (10.9)
|
22,114 (12.8)
|
106 (10.2)
|
25 (11.0)
|
314 (40.8)
|
268 (43.0)
|
646 (39.7)
|
113 (51.1)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
12 (2.9)
|
1 (0.8)
|
9 (3.7)
|
18 (4.7)
|
90 (70.3)
|
19 (65.5)
|
37 (84.1)
|
36 (97.3)
|
18 (3.9)
|
2 (3.2)
|
6 (3.0)
|
9 (4.9)
|
|
Heart Disease
|
None
|
None
|
None
|
None
|
235 (22.6)
|
35 (15.4)
|
381 (49.5)
|
294 (47.2)
|
1,093 (67.2)
|
142 (64.3)
|
All
|
All
|
None
|
None
|
1 (0.4)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
46 (10.9)
|
11 (8.3)
|
28 (11.5)
|
61 (15.8)
|
48 (37.5)
|
7 (24.1)
|
29 (65.9)
|
24 (64.9)
|
39 (8.5)
|
5 (8.1)
|
48 (23.6)
|
14 (7.6)
|
|
Kidney Disease
|
295 (4.5)
|
25 (4.6)
|
6,986 (4.6)
|
8,151 (4.7)
|
245 (23.6)
|
47 (20.7)
|
385 (50.1)
|
263 (42.2)
|
1,083 (66.6)
|
134 (60.6)
|
All
|
All
|
10 (1.2)
|
None
|
2 (0.9)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
90 (21.4)
|
27 (20.3)
|
193 (79.4)
|
311 (80.8)
|
98 (76.6)
|
17 (58.6)
|
38 (86.4)
|
29 (78.4)
|
82 (17.9)
|
11 (17.7)
|
163 (80.3)
|
145 (78.8)
|
|
Dyslipidemia
|
None
|
None
|
None
|
None
|
404 (38.8)
|
73 (32.2)
|
439 (57.1)
|
215 (34.5)
|
719 (44.2)
|
121 (54.8)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
1 (0.6)
|
None
|
None
|
None
|
147 (34.9)
|
44 (33.1)
|
172 (70.8)
|
272 (70.6)
|
56 (43.8)
|
7 (24.1)
|
28 (63.6)
|
29 (78.4)
|
159 (34.8)
|
16 (25.8)
|
148 (72.9)
|
117 (63.6)
|
|
Stroke
|
None
|
None
|
None
|
None
|
320 (30.8)
|
65 (28.6)
|
268 (34.9)
|
90 (14.4)
|
1,089 (66.9)
|
166 (75.1)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
65 (15.4)
|
18 (13.5)
|
32 (13.2)
|
50 (13.0)
|
68 (53.1)
|
11 (37.9)
|
31 (70.5)
|
26 (70.3)
|
53 (11.6)
|
7 (11.3)
|
46 (22.7)
|
15 (8.2)
|
|
Alcoholic Drinks / Week
|
840 (12.8)
|
251 (45.9)
|
31,086 (20.3)
|
59,837 (34.7)
|
7 (0.7)
|
1 (0.4)
|
124 (16.1)
|
204 (32.7)
|
1,079 (66.3)
|
150 (67.9)
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
34 (8.1)
|
3 (2.3)
|
2 (0.8)
|
7 (1.8)
|
None
|
None
|
None
|
None
|
36 (7.9)
|
2 (3.2)
|
3 (1.5)
|
1 (0.5)
|
|
Sugar-Sweetened Drinks / Week
|
All
|
All
|
All
|
All
|
126 (12.1)
|
31 (13.7)
|
173 (22.5)
|
51 (8.2)
|
1,352 (83.1)
|
190 (86.0)
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
76 (18.1)
|
17 (12.8)
|
6 (2.5)
|
10 (2.6)
|
2 (1.6)
|
None
|
None
|
None
|
91 (19.9)
|
5 (8.1)
|
4 (2.0)
|
2 (1.1)
|
|
Current Smoker
|
None
|
None
|
None
|
None
|
486 (46.7)
|
128 (56.4)
|
263 (34.2)
|
143 (23.0)
|
1,030 (63.3)
|
135 (61.1)
|
All
|
All
|
All
|
All
|
2 (0.9)
|
None
|
None
|
None
|
None
|
None
|
3 (2.7)
|
None
|
216 (51.3)
|
72 (54.1)
|
100 (41.2)
|
191 (49.6)
|
59 (46.1)
|
9 (31.0)
|
29 (65.9)
|
25 (67.6)
|
319 (69.8)
|
22 (35.5)
|
95 (46.8)
|
71 (38.6)
|
|
Family History of Gout
|
All
|
All
|
All
|
All
|
99 (9.5)
|
26 (11.5)
|
410 (53.3)
|
300 (48.2)
|
703 (43.2)
|
101 (45.7)
|
36 (80.0)
|
76 (90.5)
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
61 (14.5)
|
16 (12.0)
|
42 (17.3)
|
59 (15.3)
|
10 (7.8)
|
2 (6.9)
|
3 (6.8)
|
4 (10.8)
|
58 (12.7)
|
3 (4.8)
|
26 (12.8)
|
24 (13.0)
|
|
No. Relatives w/ Gout
|
All
|
All
|
All
|
All
|
327 (31.4)
|
81 (35.7)
|
535 (69.6)
|
462 (74.2)
|
1,229 (75.5)
|
179 (81.0)
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
111 (26.4)
|
33 (24.8)
|
66 (27.2)
|
89 (23.1)
|
38 (29.7)
|
9 (31.0)
|
17 (38.6)
|
10 (27.0)
|
94 (20.6)
|
18 (29.0)
|
45 (22.2)
|
49 (26.6)
|
|
Note:
|
|
‘All’ = all missing, ‘None’ = none missing
|
From this table we can tell the following:
Most cohorts have at minimum 50 individuals within them, though all Ardea Ironwood studies and the East Polynesian Ngati Porou gout cohort have less than 30 females, suggesting these cohorts will have limited power for study of gout phenotypes.
Very few cohorts are missing any age at recruitment measurements, though the GlobalGout cohorts has close to 10% missingness for this phenotype. For further analyses, anybody missing this phenotype should be excluded.
Among gout cohorts, serum urate is missing in less than 6% of individuals, excluding the GlobalGout cohort, with close to 20% missing data. Control cohorts have slightly higher missing data rates for this variable. This variable should therefore be suitable for use in analyses.
Among gout cohorts, data for current ULT usage is missing in close to 50% of individuals among all cohorts excluding the Ardea cohorts (though it is missing at 50% or more in CRYSTAL cohorts).
Among gout cohorts (excluding UK Biobank), data for age at gout onset and disease duration are missing in around 10% or fewer individuals for all cohorts except GlobalGout, which has over 40% missingness for these variables.
Among gout cohorts (excluding UK Biobank), data for flare frequency is missing in up to ~20% of individuals for most cohorts, but around 43% of the GlobalGout cohort are missing this variable.
Among gout cohorts (excluding UK Biobank), data for tophaceous disease is missing at very low rates among Ardea cohorts, but is missing in around 25% of individuals among Aus/NZ European cohorts and Polynesian cohorts. Importantly, it is missing in close to two thirds of individuals in the GlobalGout cohort.
The gout PRS has very little missing data in all cohorts, though there is up to 10% missing data for individuals in the GlobalGout and UK Biobank cohorts.
The urate PRS is generally missing at higher rates than the gout PRS, and thus comparisons should be carefully done to ensure the cohort size is consistent.
Data on prophylaxis, comorbidities, lifestyle factors, and family history of gout are all inconsistently phenotyped across the cohorts and thus should not be used in analyses (with the exception of BMI which is missing at low rates in all cohorts except for GlobalGout).
Based on the above observations, I will take the following actions:
I will remove anyone missing age at recruitment.
I will remove anyone missing all three severity traits (excluding controls and UK Biobank).
I will remove anyone missing the gout PRS (and the urate PRS for models comparing the two risk scores).
# Excluding individuals missing age at recruitment, and those missing the gout PRS, and those gout cases missing all three severity traits (excluding the UK Biobank cohort).
all_pheno_prs2 <- all_pheno_prs %>%
filter(!is.na(AGECOL),
!is.na(PRS),
(GOUT & !(is.na(AGE1ATK) & is.na(NUMATK) & is.na(TOPHIGOUT)) | Pheno.Study == "UK Biobank" | !GOUT))
# Remaking the data_list object with the newly filtered data.
data_list2 <- list(all_pheno_prs2 %>% filter(SEX == "Male",
GOUT,
Pheno.Study == "UK Biobank"),
all_pheno_prs2 %>% filter(SEX == "Female",
GOUT,
Pheno.Study == "UK Biobank"),
all_pheno_prs2 %>% filter(SEX == "Male",
!GOUT,
Pheno.Study == "UK Biobank"),
all_pheno_prs2 %>% filter(SEX == "Female",
!GOUT,
Pheno.Study == "UK Biobank"),
all_pheno_prs2 %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
all_pheno_prs2 %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
all_pheno_prs2 %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
all_pheno_prs2 %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
all_pheno_prs2 %>% filter(SEX == "Male",
GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs2 %>% filter(SEX == "Female",
GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs2 %>% filter(SEX == "Male",
!GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs2 %>% filter(SEX == "Female",
!GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs2 %>% filter(SEX == "Male",
Pheno.Study == "Ardea: 401",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs2 %>% filter(SEX == "Female",
Pheno.Study == "Ardea: 401",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs2 %>% filter(SEX == "Male",
Pheno.Study == "Ardea: CLEAR1",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs2 %>% filter(SEX == "Female",
Pheno.Study == "Ardea: CLEAR1",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs2 %>% filter(SEX == "Male",
Pheno.Study == "Ardea: CLEAR2",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs2 %>% filter(SEX == "Female",
Pheno.Study == "Ardea: CLEAR2",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs2 %>% filter(SEX == "Male",
Pheno.Study == "Ardea: CRYSTAL",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs2 %>% filter(SEX == "Female",
Pheno.Study == "Ardea: CRYSTAL",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs2 %>% filter(SEX == "Male",
Pheno.Study == "Ardea: LIGHT",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs2 %>% filter(SEX == "Female",
Pheno.Study == "Ardea: LIGHT",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
all_pheno_prs2 %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
all_pheno_prs2 %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
all_pheno_prs2 %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
all_pheno_prs2 %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
all_pheno_prs2 %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
all_pheno_prs2 %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
all_pheno_prs2 %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
all_pheno_prs2 %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
all_pheno_prs2 %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")),
all_pheno_prs2 %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")),
all_pheno_prs2 %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")),
all_pheno_prs2 %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")))
# Remaking missing data table after filtering.
table1 <- tibble("Cohort" = cohortstring,
"N" = unlist(lapply(data_list2, function(x) format(nrow(x), big.mark = ","))),
"Age at Collection" = unlist(lapply(data_list2, function(x) missing(x$AGECOL))),
"Serum Urate" = unlist(lapply(data_list2, function(x) missing(x$URATE))),
"ULT" = unlist(lapply(data_list2, function(x) missing(x$ULT))),
"Age at Onset" = unlist(lapply(data_list2, function(x) missing(x$AGE1ATK))),
"Disease Duration" = unlist(lapply(data_list2, function(x) missing(x$DURATION))),
"Flares" = unlist(lapply(data_list2, function(x) missing(x$NUMATK))),
"Tophi" = unlist(lapply(data_list2, function(x) missing(x$TOPHIGOUT))),
"PRS" = unlist(lapply(data_list2, function(x) missing(x$PRS))),
"Urate PRS" = unlist(lapply(data_list2, function(x) missing(x$Urate_PRS))),
"Prophylaxis" = unlist(lapply(data_list2, function(x) missing(x$PROPHY))),
"BMI" = unlist(lapply(data_list2, function(x) missing(x$BMI))),
"Hypertension" = unlist(lapply(data_list2, function(x) missing(x$HYPERTENSION))),
"Type 2 Diabetes" = unlist(lapply(data_list2, function(x) missing(x$DIABETES))),
"Heart Disease" = unlist(lapply(data_list2, function(x) missing(x$HEART))),
"Kidney Disease" = unlist(lapply(data_list2, function(x) missing(x$KIDNEY))),
"Dyslipidemia" = unlist(lapply(data_list2, function(x) missing(x$LIPIDS))),
"Stroke" = unlist(lapply(data_list2, function(x) missing(x$STROKE))),
"Alcoholic Drinks / Week" = unlist(lapply(data_list2, function(x) missing(x$TOTALALC))),
"Sugar-Sweetened Drinks / Week" = unlist(lapply(data_list2, function(x) missing(x$SUGDRINK))),
"Current Smoker" = unlist(lapply(data_list2, function(x) missing(x$CURSMOKE))),
"Family History of Gout" = unlist(lapply(data_list2, function(x) missing(x$FAMGOUT))),
"No. Relatives w/ Gout" = unlist(lapply(data_list2, function(x) missing(x$FAMGOUTNUM))))
# Transposing table.
table1 <- transpose_df(table1) %>%
column_to_rownames(var = "Cohort") %>%
mutate(across(.cols = 1:ncol(table1), ~ str_replace(string = .x, pattern = " ", replacement = " ")))
# Further preparing for printing.
row.names(table1) <- str_replace(row.names(table1), " ", " ")
# Printing table.
table1 %>%
kable(col.names = clean_names,
align = "c",
escape = F) %>%
kable_styling("striped") %>%
scroll_box(width = "800px", height = "490px") %>%
footnote("'All' = all missing, 'None' = none missing")
|
|
UK Biobank Gout Male
|
UK Biobank Gout Female
|
UK Biobank Control Male
|
UK Biobank Control Female
|
Aus/NZ European Gout Male
|
Aus/NZ European Gout Female
|
Aus/NZ European Control Male
|
Aus/NZ European Control Female
|
GlobalGout Gout Male
|
GlobalGout Gout Female
|
GlobalGout Control Male
|
GlobalGout Control Female
|
Ardea LASSO Gout Male
|
Ardea LASSO Gout Female
|
Ardea CLEAR1 Gout Male
|
Ardea CLEAR1 Gout Female
|
Ardea CLEAR2 Gout Male
|
Ardea CLEAR2 Gout Female
|
Ardea CRYSTAL Gout Male
|
Ardea CRYSTAL Gout Female
|
Ardea LIGHT Gout Male
|
Ardea LIGHT Gout Female
|
East Polynesian Gout Male
|
East Polynesian Gout Female
|
East Polynesian Control Male
|
East Polynesian Control Female
|
East Polynesian Gout Male NP
|
East Polynesian Gout Female NP
|
East Polynesian Control Male NP
|
East Polynesian Control Female NP
|
West Polynesian Gout Male
|
West Polynesian Gout Female
|
West Polynesian Control Male
|
West Polynesian Control Female
|
|
N
|
6,210
|
522
|
143,574
|
162,244
|
978
|
210
|
764
|
619
|
1,032
|
124
|
44
|
79
|
819
|
65
|
230
|
16
|
239
|
7
|
175
|
4
|
109
|
8
|
408
|
122
|
241
|
382
|
124
|
28
|
44
|
37
|
436
|
54
|
199
|
180
|
|
Age at Collection
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
|
Serum Urate
|
287 (4.6)
|
29 (5.6)
|
6,684 (4.7)
|
7,869 (4.9)
|
18 (1.8)
|
4 (1.9)
|
36 (4.7)
|
24 (3.9)
|
50 (4.8)
|
8 (6.5)
|
32 (72.7)
|
71 (89.9)
|
4 (0.5)
|
1 (1.5)
|
1 (0.4)
|
None
|
1 (0.4)
|
None
|
None
|
None
|
None
|
None
|
2 (0.5)
|
1 (0.8)
|
35 (14.5)
|
64 (16.8)
|
None
|
1 (3.6)
|
None
|
None
|
1 (0.2)
|
1 (1.9)
|
21 (10.6)
|
16 (8.9)
|
|
ULT
|
None
|
None
|
All
|
All
|
394 (40.3)
|
97 (46.2)
|
All
|
All
|
279 (27.0)
|
42 (33.9)
|
All
|
All
|
3 (0.4)
|
1 (1.5)
|
None
|
None
|
None
|
None
|
81 (46.3)
|
3 (75.0)
|
None
|
None
|
124 (30.4)
|
32 (26.2)
|
All
|
All
|
21 (16.9)
|
2 (7.1)
|
All
|
All
|
126 (28.9)
|
14 (25.9)
|
All
|
All
|
|
Age at Onset
|
All
|
All
|
All
|
All
|
51 (5.2)
|
24 (11.4)
|
All
|
All
|
25 (2.4)
|
6 (4.8)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
27 (6.6)
|
9 (7.4)
|
All
|
All
|
1 (0.8)
|
None
|
All
|
All
|
18 (4.1)
|
9 (16.7)
|
All
|
All
|
|
Disease Duration
|
All
|
All
|
All
|
All
|
51 (5.2)
|
24 (11.4)
|
All
|
All
|
25 (2.4)
|
6 (4.8)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
27 (6.6)
|
9 (7.4)
|
All
|
All
|
1 (0.8)
|
None
|
All
|
All
|
18 (4.1)
|
9 (16.7)
|
All
|
All
|
|
Flares
|
All
|
All
|
All
|
All
|
119 (12.2)
|
34 (16.2)
|
All
|
All
|
72 (7.0)
|
6 (4.8)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
32 (7.8)
|
14 (11.5)
|
All
|
All
|
3 (2.4)
|
2 (7.1)
|
All
|
All
|
20 (4.6)
|
9 (16.7)
|
All
|
All
|
|
Tophi
|
All
|
All
|
All
|
All
|
210 (21.5)
|
42 (20.0)
|
All
|
All
|
476 (46.1)
|
50 (40.3)
|
All
|
All
|
None
|
None
|
2 (0.9)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
59 (14.5)
|
30 (24.6)
|
All
|
All
|
50 (40.3)
|
7 (25.0)
|
All
|
All
|
39 (8.9)
|
5 (9.3)
|
All
|
All
|
|
PRS
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
|
Urate PRS
|
2,049 (33.0)
|
166 (31.8)
|
47,833 (33.3)
|
54,154 (33.4)
|
49 (5.0)
|
7 (3.3)
|
20 (2.6)
|
90 (14.5)
|
56 (5.4)
|
7 (5.6)
|
1 (2.3)
|
None
|
28 (3.4)
|
3 (4.6)
|
5 (2.2)
|
None
|
14 (5.9)
|
None
|
4 (2.3)
|
None
|
3 (2.8)
|
None
|
14 (3.4)
|
4 (3.3)
|
8 (3.3)
|
23 (6.0)
|
8 (6.5)
|
2 (7.1)
|
None
|
3 (8.1)
|
15 (3.4)
|
1 (1.9)
|
7 (3.5)
|
10 (5.6)
|
|
Prophylaxis
|
All
|
All
|
All
|
All
|
918 (93.9)
|
196 (93.3)
|
763 (99.9)
|
All
|
317 (30.7)
|
43 (34.7)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
348 (85.3)
|
108 (88.5)
|
240 (99.6)
|
All
|
None
|
2 (7.1)
|
43 (97.7)
|
All
|
390 (89.4)
|
45 (83.3)
|
All
|
179 (99.4)
|
|
BMI
|
22 (0.4)
|
3 (0.6)
|
454 (0.3)
|
463 (0.3)
|
58 (5.9)
|
18 (8.6)
|
174 (22.8)
|
53 (8.6)
|
44 (4.3)
|
2 (1.6)
|
All
|
All
|
5 (0.6)
|
None
|
1 (0.4)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
7 (1.7)
|
4 (3.3)
|
3 (1.2)
|
13 (3.4)
|
2 (1.6)
|
None
|
4 (9.1)
|
6 (16.2)
|
10 (2.3)
|
3 (5.6)
|
3 (1.5)
|
8 (4.4)
|
|
Hypertension
|
None
|
None
|
None
|
None
|
383 (39.2)
|
36 (17.1)
|
486 (63.6)
|
322 (52.0)
|
365 (35.4)
|
35 (28.2)
|
All
|
All
|
None
|
None
|
1 (0.4)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
120 (29.4)
|
14 (11.5)
|
171 (71.0)
|
252 (66.0)
|
30 (24.2)
|
3 (10.7)
|
32 (72.7)
|
28 (75.7)
|
196 (45.0)
|
13 (24.1)
|
158 (79.4)
|
135 (75.0)
|
|
Type 2 Diabetes
|
636 (10.2)
|
53 (10.2)
|
15,566 (10.8)
|
20,793 (12.8)
|
72 (7.4)
|
17 (8.1)
|
312 (40.8)
|
266 (43.0)
|
181 (17.5)
|
29 (23.4)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
11 (2.7)
|
1 (0.8)
|
9 (3.7)
|
18 (4.7)
|
86 (69.4)
|
18 (64.3)
|
37 (84.1)
|
36 (97.3)
|
15 (3.4)
|
None
|
6 (3.0)
|
8 (4.4)
|
|
Heart Disease
|
None
|
None
|
None
|
None
|
209 (21.4)
|
32 (15.2)
|
379 (49.6)
|
292 (47.2)
|
518 (50.2)
|
52 (41.9)
|
All
|
All
|
None
|
None
|
1 (0.4)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
45 (11.0)
|
11 (9.0)
|
28 (11.6)
|
60 (15.7)
|
46 (37.1)
|
7 (25.0)
|
29 (65.9)
|
24 (64.9)
|
34 (7.8)
|
4 (7.4)
|
46 (23.1)
|
13 (7.2)
|
|
Kidney Disease
|
277 (4.5)
|
25 (4.8)
|
6,530 (4.5)
|
7,657 (4.7)
|
222 (22.7)
|
43 (20.5)
|
383 (50.1)
|
261 (42.2)
|
512 (49.6)
|
44 (35.5)
|
All
|
All
|
5 (0.6)
|
None
|
2 (0.9)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
79 (19.4)
|
25 (20.5)
|
191 (79.3)
|
308 (80.6)
|
94 (75.8)
|
16 (57.1)
|
38 (86.4)
|
29 (78.4)
|
73 (16.7)
|
9 (16.7)
|
160 (80.4)
|
141 (78.3)
|
|
Dyslipidemia
|
None
|
None
|
None
|
None
|
376 (38.4)
|
71 (33.8)
|
438 (57.3)
|
214 (34.6)
|
255 (24.7)
|
35 (28.2)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
1 (0.6)
|
None
|
None
|
None
|
139 (34.1)
|
39 (32.0)
|
171 (71.0)
|
269 (70.4)
|
53 (42.7)
|
7 (25.0)
|
28 (63.6)
|
29 (78.4)
|
150 (34.4)
|
14 (25.9)
|
147 (73.9)
|
115 (63.9)
|
|
Stroke
|
None
|
None
|
None
|
None
|
298 (30.5)
|
62 (29.5)
|
267 (34.9)
|
90 (14.5)
|
603 (58.4)
|
73 (58.9)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
63 (15.4)
|
16 (13.1)
|
32 (13.3)
|
49 (12.8)
|
66 (53.2)
|
11 (39.3)
|
31 (70.5)
|
26 (70.3)
|
48 (11.0)
|
5 (9.3)
|
44 (22.1)
|
14 (7.8)
|
|
Alcoholic Drinks / Week
|
793 (12.8)
|
240 (46.0)
|
29,263 (20.4)
|
56,288 (34.7)
|
None
|
None
|
123 (16.1)
|
202 (32.6)
|
509 (49.3)
|
57 (46.0)
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
34 (8.3)
|
3 (2.5)
|
2 (0.8)
|
7 (1.8)
|
None
|
None
|
None
|
None
|
33 (7.6)
|
2 (3.7)
|
3 (1.5)
|
1 (0.6)
|
|
Sugar-Sweetened Drinks / Week
|
All
|
All
|
All
|
All
|
106 (10.8)
|
27 (12.9)
|
172 (22.5)
|
51 (8.2)
|
762 (73.8)
|
94 (75.8)
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
75 (18.4)
|
17 (13.9)
|
6 (2.5)
|
10 (2.6)
|
None
|
None
|
None
|
None
|
87 (20.0)
|
4 (7.4)
|
4 (2.0)
|
1 (0.6)
|
|
Current Smoker
|
None
|
None
|
None
|
None
|
463 (47.3)
|
124 (59.0)
|
262 (34.3)
|
142 (22.9)
|
464 (45.0)
|
47 (37.9)
|
All
|
All
|
All
|
All
|
2 (0.9)
|
None
|
None
|
None
|
None
|
None
|
3 (2.8)
|
None
|
211 (51.7)
|
69 (56.6)
|
100 (41.5)
|
190 (49.7)
|
57 (46.0)
|
9 (32.1)
|
29 (65.9)
|
25 (67.6)
|
312 (71.6)
|
21 (38.9)
|
92 (46.2)
|
69 (38.3)
|
|
Family History of Gout
|
All
|
All
|
All
|
All
|
71 (7.3)
|
21 (10.0)
|
408 (53.4)
|
298 (48.1)
|
244 (23.6)
|
19 (15.3)
|
35 (79.5)
|
71 (89.9)
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
56 (13.7)
|
15 (12.3)
|
42 (17.4)
|
58 (15.2)
|
8 (6.5)
|
2 (7.1)
|
3 (6.8)
|
4 (10.8)
|
54 (12.4)
|
2 (3.7)
|
26 (13.1)
|
23 (12.8)
|
|
No. Relatives w/ Gout
|
All
|
All
|
All
|
All
|
282 (28.8)
|
64 (30.5)
|
531 (69.5)
|
459 (74.2)
|
643 (62.3)
|
84 (67.7)
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
104 (25.5)
|
30 (24.6)
|
65 (27.0)
|
88 (23.0)
|
36 (29.0)
|
8 (28.6)
|
17 (38.6)
|
10 (27.0)
|
84 (19.3)
|
13 (24.1)
|
45 (22.6)
|
46 (25.6)
|
|
Note:
|
|
‘All’ = all missing, ‘None’ = none missing
|
Based on this new missing data table, the following observations can be made:
Most cohorts have at minimum 50 individuals within them, though all Ardea Ironwood studies and the East Polynesian Ngati Porou gout cohort have less than 30 females, suggesting these cohorts will have limited power for study of gout phenotypes.
No individuals are missing any age at recruitment measurements.
Among all gout cohorts, serum urate is missing in less than 7% of individuals. Control cohorts have slightly higher missing data rates for this variable. This variable should therefore be suitable for use in analyses.
Among all gout cohorts, data for current ULT usage is missing in up to 35% of individuals, with very little missing data among Ardea cohorts and the UK Biobank cohort. The exceptions are the Aus/NZ European gout cohorts at close to 40% missing data, and the CRYSTAL cohort with around 50% missing data.
Among gout cohorts (excluding UK Biobank), data for age at gout onset and disease duration are missing in around 10% or fewer individuals for all cohorts.
Among gout cohorts (excluding UK Biobank), data for flare frequency is missing in up to ~15% of individuals.
Among gout cohorts (excluding UK Biobank), data for tophaceous disease is missing at very low rates among Ardea cohorts, but is missing in around 20% of individuals among Aus/NZ European cohorts and Polynesian cohorts. Importantly, it is missing in close to 40% of individuals in the GlobalGout cohort.
The gout PRS has no missing data in all cohorts.
The urate PRS is missing at around 5% in all cohorts, though it is missing at close to 33% in the UK Biobank. Comparisons with the gout PRS should be carefully done to ensure the cohort size is consistent.
Data on prophylaxis, comorbidities, lifestyle factors, and family history of gout are all inconsistently phenotyped across the cohorts and thus should not be used in analyses (with the exception of BMI which is missing at low rates in all cohorts).
Given the relatively low proportions of missingness for the most important variables for this study, we can use complete cases only for each model. This will of course reduce power, and may introduce bias which I need to be okay with as a limitation.
Characteristics of each Cohort
# Making table with statistics describing the distribution of each variable for each cohort (aftering filtering for missing data).
table1 <- tibble("Cohort" = cohortstring,
"N" = unlist(lapply(data_list2, function(x) format(nrow(x), big.mark = ","))),
"Age at Recruitment (years)" = unlist(lapply(data_list2, function(x) report(x$AGECOL))),
"Serum Urate (mg/dL)" = unlist(lapply(data_list2, function(x) report(x$URATE))),
"ULT" = unlist(lapply(data_list2, function(x) sumreport(x$ULT))),
"Age at Onset (years)" = unlist(lapply(data_list2, function(x) report(x$AGE1ATK))),
"Disease Duration (years)" = unlist(lapply(data_list2, function(x) report(x$DURATION))),
"Number of Flares in Last Year" = unlist(lapply(data_list2, function(x) report_median(x$NUMATK))),
"Tophaceous Disease" = unlist(lapply(data_list2, function(x) sumreport(x$TOPHIGOUT))),
"PRS" = unlist(lapply(data_list2, function(x) report(x$PRS))),
"Urate PRS" = unlist(lapply(data_list2, function(x) report(x$Urate_PRS))),
"Prophylaxis" = unlist(lapply(data_list2, function(x) sumreport(x$PROPHY))),
"BMI" = unlist(lapply(data_list2, function(x) report(x$BMI))),
"Hypertension" = unlist(lapply(data_list2, function(x) sumreport(x$HYPERTENSION))),
"Type 2 Diabetes" = unlist(lapply(data_list2, function(x) sumreport(x$DIABETES))),
"Heart Disease" = unlist(lapply(data_list2, function(x) sumreport(x$HEART))),
"Kidney Disease" = unlist(lapply(data_list2, function(x) sumreport(x$KIDNEY))),
"Dyslipidemia" = unlist(lapply(data_list2, function(x) sumreport(x$LIPIDS))),
"Stroke" = unlist(lapply(data_list2, function(x) sumreport(x$STROKE))),
"Alcoholic Drinks / Week" = unlist(lapply(data_list2, function(x) report(x$TOTALALC))),
"Sugar-Sweetened Drinks / Week" = unlist(lapply(data_list2, function(x) report(x$SUGDRINK))),
"Current Smoker" = unlist(lapply(data_list2, function(x) sumreport(x$CURSMOKE))),
"Family History of Gout" = unlist(lapply(data_list2, function(x) sumreport(x$FAMGOUT))),
"No. Relatives w/ Gout" = unlist(lapply(data_list2, function(x) report(x$FAMGOUTNUM))))
# Tranposing the table.
table1 <- transpose_df(table1) %>%
column_to_rownames(var = "Cohort") %>%
mutate(across(.cols = 1:ncol(table1), ~ str_replace(string = .x, pattern = " ", replacement = " ")))
# Further preparing the table for printing.
row.names(table1) <- str_replace(row.names(table1), " ", " ")
# Printing the table.
table1 %>%
kable(col.names = clean_names,
align = "c",
escape = F) %>%
kable_styling("striped") %>%
scroll_box(width = "800px", height = "490px") %>%
footnote("Flares reported as median (inter-quartile range). All other numeric variables reported as mean ± sd. All binary variables reported as N (%).")
|
|
UK Biobank Gout Male
|
UK Biobank Gout Female
|
UK Biobank Control Male
|
UK Biobank Control Female
|
Aus/NZ European Gout Male
|
Aus/NZ European Gout Female
|
Aus/NZ European Control Male
|
Aus/NZ European Control Female
|
GlobalGout Gout Male
|
GlobalGout Gout Female
|
GlobalGout Control Male
|
GlobalGout Control Female
|
Ardea LASSO Gout Male
|
Ardea LASSO Gout Female
|
Ardea CLEAR1 Gout Male
|
Ardea CLEAR1 Gout Female
|
Ardea CLEAR2 Gout Male
|
Ardea CLEAR2 Gout Female
|
Ardea CRYSTAL Gout Male
|
Ardea CRYSTAL Gout Female
|
Ardea LIGHT Gout Male
|
Ardea LIGHT Gout Female
|
East Polynesian Gout Male
|
East Polynesian Gout Female
|
East Polynesian Control Male
|
East Polynesian Control Female
|
East Polynesian Gout Male NP
|
East Polynesian Gout Female NP
|
East Polynesian Control Male NP
|
East Polynesian Control Female NP
|
West Polynesian Gout Male
|
West Polynesian Gout Female
|
West Polynesian Control Male
|
West Polynesian Control Female
|
|
N
|
6,210
|
522
|
143,574
|
162,244
|
978
|
210
|
764
|
619
|
1,032
|
124
|
44
|
79
|
819
|
65
|
230
|
16
|
239
|
7
|
175
|
4
|
109
|
8
|
408
|
122
|
241
|
382
|
124
|
28
|
44
|
37
|
436
|
54
|
199
|
180
|
|
Age at Recruitment (years)
|
59.7 ± 7.0
|
61.8 ± 6.0
|
57.2 ± 8.1
|
57.0 ± 7.8
|
62.4 ± 12.4
|
70.0 ± 12.7
|
54.9 ± 17.1
|
51.3 ± 17.3
|
60.1 ± 13.1
|
67.6 ± 11.1
|
60.0 ± 14.8
|
64.2 ± 11.5
|
51.4 ± 11.8
|
60.7 ± 10.6
|
52.3 ± 11.1
|
61.4 ± 7.4
|
53.0 ± 10.8
|
55.0 ± 16.0
|
53.9 ± 11.0
|
63.8 ± 5.4
|
53.3 ± 11.8
|
64.0 ± 16.1
|
54.3 ± 12.4
|
60.7 ± 11.7
|
43.9 ± 15.6
|
45.6 ± 14.8
|
59.7 ± 11.3
|
59.1 ± 13.3
|
49.8 ± 13.5
|
48.7 ± 17.1
|
47.5 ± 12.3
|
53.4 ± 13.4
|
39.3 ± 15.0
|
40.5 ± 15.4
|
|
Serum Urate (mg/dL)
|
6.4 ± 1.7
|
5.5 ± 1.7
|
5.9 ± 1.2
|
4.5 ± 1.1
|
6.7 ± 1.9
|
6.4 ± 2.3
|
6.4 ± 1.7
|
4.6 ± 1.6
|
7.4 ± 2.3
|
7.7 ± 2.6
|
6.6 ± 1.7
|
6.6 ± 1.7
|
8.9 ± 1.3
|
8.9 ± 1.4
|
7.9 ± 1.4
|
8.1 ± 1.2
|
7.9 ± 1.5
|
8.4 ± 2.0
|
8.8 ± 1.5
|
10.1 ± 1.2
|
9.3 ± 1.7
|
8.1 ± 1.4
|
7.0 ± 2.3
|
6.3 ± 2.5
|
6.5 ± 1.9
|
5.4 ± 1.5
|
7.0 ± 1.7
|
6.9 ± 2.4
|
6.3 ± 1.3
|
5.3 ± 1.3
|
7.7 ± 2.1
|
7.0 ± 2.7
|
6.7 ± 1.6
|
5.4 ± 1.7
|
|
ULT
|
4499 (72.4)
|
372 (71.3)
|
NA
|
NA
|
564 (96.6)
|
108 (95.6)
|
NA
|
NA
|
570 (75.7)
|
49 (59.8)
|
NA
|
NA
|
255 (31.2)
|
26 (40.6)
|
230 (100.0)
|
16 (100.0)
|
239 (100.0)
|
7 (100.0)
|
94 (100.0)
|
1 (100.0)
|
109 (100.0)
|
8 (100.0)
|
262 (92.3)
|
86 (95.6)
|
NA
|
NA
|
96 (93.2)
|
19 (73.1)
|
NA
|
NA
|
292 (94.2)
|
38 (95.0)
|
NA
|
NA
|
|
Age at Onset (years)
|
NA
|
NA
|
NA
|
NA
|
46.4 ± 15.8
|
59.5 ± 15.7
|
NA
|
NA
|
46.5 ± 14.0
|
57.8 ± 12.5
|
NA
|
NA
|
41.4 ± 13.4
|
55.1 ± 12.0
|
41.9 ± 12.4
|
55.2 ± 11.2
|
42.6 ± 13.2
|
46.1 ± 20.6
|
40.1 ± 13.0
|
61.5 ± 6.2
|
42.4 ± 13.1
|
55.2 ± 17.3
|
37.9 ± 14.0
|
49.4 ± 15.4
|
NA
|
NA
|
39.1 ± 15.2
|
46.0 ± 16.8
|
NA
|
NA
|
34.6 ± 12.0
|
44.3 ± 15.0
|
NA
|
NA
|
|
Disease Duration (years)
|
NA
|
NA
|
NA
|
NA
|
16.8 ± 12.7
|
10.9 ± 10.4
|
NA
|
NA
|
14.5 ± 11.4
|
10.6 ± 9.8
|
NA
|
NA
|
11.0 ± 9.4
|
6.6 ± 8.0
|
11.4 ± 9.4
|
7.1 ± 9.5
|
11.4 ± 9.8
|
9.9 ± 11.6
|
14.8 ± 10.0
|
3.2 ± 1.0
|
11.9 ± 8.7
|
9.8 ± 11.0
|
17.2 ± 12.8
|
13.1 ± 13.2
|
NA
|
NA
|
21.7 ± 15.3
|
14.1 ± 12.6
|
NA
|
NA
|
13.6 ± 10.3
|
9.2 ± 9.2
|
NA
|
NA
|
|
Number of Flares in Last Year
|
NA
|
NA
|
NA
|
NA
|
2 (0 - 4)
|
1.5 (0 - 3.25)
|
NA
|
NA
|
2 (1 - 4)
|
2.5 (1 - 4)
|
NA
|
NA
|
4 (3 - 8)
|
3 (3 - 6)
|
3 (2 - 6)
|
3 (3 - 4)
|
4 (2 - 8)
|
5 (3 - 6)
|
4 (3 - 6)
|
4.5 (2.25 - 6)
|
4 (2 - 10)
|
4 (2.75 - 5.25)
|
3 (1 - 6)
|
2 (0 - 5)
|
NA
|
NA
|
2 (0 - 3)
|
3 (1 - 6)
|
NA
|
NA
|
4 (2 - 10)
|
2 (1 - 5)
|
NA
|
NA
|
|
Tophaceous Disease
|
NA
|
NA
|
NA
|
NA
|
333 (43.4)
|
67 (39.9)
|
NA
|
NA
|
320 (57.6)
|
46 (62.2)
|
NA
|
NA
|
138 (16.8)
|
5 (7.7)
|
34 (14.9)
|
1 (6.2)
|
54 (22.6)
|
2 (28.6)
|
174 (99.4)
|
4 (100.0)
|
26 (23.9)
|
5 (62.5)
|
144 (41.3)
|
26 (28.3)
|
NA
|
NA
|
9 (12.2)
|
4 (19.0)
|
NA
|
NA
|
177 (44.6)
|
14 (28.6)
|
NA
|
NA
|
|
PRS
|
4.1 ± 0.6
|
4.0 ± 0.7
|
3.7 ± 0.6
|
3.7 ± 0.6
|
4.1 ± 0.7
|
4.0 ± 0.6
|
3.7 ± 0.6
|
3.7 ± 0.6
|
4.0 ± 0.6
|
4.0 ± 0.6
|
3.8 ± 0.6
|
3.8 ± 0.6
|
4.1 ± 0.7
|
4.1 ± 0.6
|
4.2 ± 0.7
|
4.3 ± 0.6
|
4.2 ± 0.6
|
4.3 ± 0.8
|
4.2 ± 0.6
|
4.0 ± 0.4
|
4.1 ± 0.6
|
4.2 ± 0.2
|
4.4 ± 0.5
|
4.4 ± 0.5
|
4.2 ± 0.4
|
4.2 ± 0.5
|
4.2 ± 0.5
|
4.4 ± 0.5
|
4.2 ± 0.5
|
4.1 ± 0.6
|
4.8 ± 0.6
|
4.7 ± 0.6
|
4.3 ± 0.6
|
4.3 ± 0.6
|
|
Urate PRS
|
3.8 ± 0.3
|
3.8 ± 0.3
|
3.6 ± 0.3
|
3.6 ± 0.3
|
3.8 ± 0.3
|
3.8 ± 0.3
|
3.6 ± 0.3
|
3.6 ± 0.3
|
3.8 ± 0.3
|
3.7 ± 0.3
|
3.6 ± 0.3
|
3.7 ± 0.3
|
3.8 ± 0.3
|
3.8 ± 0.3
|
3.8 ± 0.3
|
3.9 ± 0.3
|
3.8 ± 0.3
|
4.0 ± 0.3
|
3.9 ± 0.3
|
3.9 ± 0.1
|
3.8 ± 0.3
|
3.9 ± 0.1
|
3.7 ± 0.3
|
3.7 ± 0.3
|
3.7 ± 0.3
|
3.7 ± 0.3
|
3.7 ± 0.3
|
3.8 ± 0.3
|
3.6 ± 0.2
|
3.7 ± 0.3
|
3.8 ± 0.3
|
3.9 ± 0.4
|
3.6 ± 0.3
|
3.6 ± 0.3
|
|
Prophylaxis
|
NA
|
NA
|
NA
|
NA
|
56 (93.3)
|
13 (92.9)
|
0 (0.0)
|
NA
|
443 (62.0)
|
55 (67.9)
|
NA
|
NA
|
810 (98.9)
|
65 (100.0)
|
230 (100.0)
|
16 (100.0)
|
239 (100.0)
|
7 (100.0)
|
175 (100.0)
|
4 (100.0)
|
109 (100.0)
|
8 (100.0)
|
59 (98.3)
|
13 (92.9)
|
1 (100.0)
|
NA
|
112 (90.3)
|
23 (88.5)
|
1 (100.0)
|
NA
|
46 (100.0)
|
7 (77.8)
|
NA
|
0 (0.0)
|
|
BMI
|
30.5 ± 4.8
|
32.1 ± 6.4
|
27.6 ± 4.1
|
26.8 ± 5.0
|
30.3 ± 5.2
|
30.9 ± 7.3
|
27.2 ± 4.7
|
27.0 ± 6.2
|
29.4 ± 4.7
|
30.9 ± 6.7
|
NA
|
NA
|
34.1 ± 6.7
|
38.0 ± 10.3
|
34.6 ± 6.1
|
38.1 ± 6.5
|
33.7 ± 6.0
|
36.2 ± 7.5
|
32.2 ± 5.4
|
36.5 ± 3.8
|
31.3 ± 4.9
|
35.7 ± 8.7
|
35.4 ± 8.0
|
38.2 ± 9.8
|
31.9 ± 7.1
|
32.7 ± 8.5
|
35.9 ± 7.7
|
39.5 ± 7.6
|
32.5 ± 7.8
|
29.1 ± 6.1
|
36.1 ± 6.7
|
38.5 ± 9.1
|
33.1 ± 6.2
|
34.3 ± 7.7
|
|
Hypertension
|
4242 (68.3)
|
413 (79.1)
|
56792 (39.6)
|
48708 (30.0)
|
573 (96.3)
|
172 (98.9)
|
169 (60.8)
|
121 (40.7)
|
662 (99.3)
|
89 (100.0)
|
NA
|
NA
|
401 (49.0)
|
48 (73.8)
|
146 (63.8)
|
15 (93.8)
|
166 (69.5)
|
6 (85.7)
|
101 (57.7)
|
4 (100.0)
|
57 (52.3)
|
8 (100.0)
|
267 (92.7)
|
107 (99.1)
|
69 (98.6)
|
119 (91.5)
|
94 (100.0)
|
25 (100.0)
|
12 (100.0)
|
9 (100.0)
|
212 (88.3)
|
39 (95.1)
|
36 (87.8)
|
43 (95.6)
|
|
Type 2 Diabetes
|
1090 (19.6)
|
119 (25.4)
|
11226 (8.8)
|
6729 (4.8)
|
144 (15.9)
|
51 (26.4)
|
55 (12.2)
|
48 (13.6)
|
350 (41.1)
|
53 (55.8)
|
NA
|
NA
|
79 (9.6)
|
16 (24.6)
|
30 (13.0)
|
6 (37.5)
|
32 (13.4)
|
0 (0.0)
|
24 (13.7)
|
2 (50.0)
|
12 (11.0)
|
0 (0.0)
|
121 (30.5)
|
61 (50.4)
|
53 (22.8)
|
77 (21.2)
|
38 (100.0)
|
10 (100.0)
|
7 (100.0)
|
1 (100.0)
|
80 (19.0)
|
26 (48.1)
|
33 (17.1)
|
44 (25.6)
|
|
Heart Disease
|
1441 (23.2)
|
137 (26.2)
|
19949 (13.9)
|
9997 (6.2)
|
320 (41.6)
|
92 (51.7)
|
84 (21.8)
|
37 (11.3)
|
159 (30.9)
|
39 (54.2)
|
NA
|
NA
|
40 (4.9)
|
3 (4.6)
|
16 (7.0)
|
0 (0.0)
|
25 (10.5)
|
0 (0.0)
|
17 (9.7)
|
0 (0.0)
|
5 (4.6)
|
0 (0.0)
|
139 (38.3)
|
65 (58.6)
|
45 (21.1)
|
47 (14.6)
|
40 (51.3)
|
13 (61.9)
|
4 (26.7)
|
2 (15.4)
|
77 (19.2)
|
17 (34.0)
|
12 (7.8)
|
18 (10.8)
|
|
Kidney Disease
|
731 (12.3)
|
134 (27.0)
|
4755 (3.5)
|
7077 (4.6)
|
359 (47.5)
|
122 (73.1)
|
214 (56.2)
|
251 (70.1)
|
238 (45.8)
|
64 (80.0)
|
NA
|
NA
|
138 (17.0)
|
28 (43.1)
|
28 (12.3)
|
9 (56.2)
|
38 (15.9)
|
4 (57.1)
|
31 (17.7)
|
2 (50.0)
|
13 (11.9)
|
4 (50.0)
|
153 (46.5)
|
71 (73.2)
|
34 (68.0)
|
57 (77.0)
|
28 (93.3)
|
12 (100.0)
|
4 (66.7)
|
8 (100.0)
|
130 (35.8)
|
32 (71.1)
|
21 (53.8)
|
30 (76.9)
|
|
Dyslipidemia
|
3102 (50.0)
|
276 (52.9)
|
41243 (28.7)
|
27809 (17.1)
|
494 (82.1)
|
117 (84.2)
|
169 (51.8)
|
143 (35.3)
|
563 (72.5)
|
71 (79.8)
|
NA
|
NA
|
331 (40.4)
|
37 (56.9)
|
110 (47.8)
|
13 (81.2)
|
98 (41.0)
|
4 (57.1)
|
78 (44.8)
|
3 (75.0)
|
41 (37.6)
|
5 (62.5)
|
235 (87.4)
|
77 (92.8)
|
56 (80.0)
|
90 (79.6)
|
71 (100.0)
|
21 (100.0)
|
16 (100.0)
|
8 (100.0)
|
238 (83.2)
|
36 (90.0)
|
38 (73.1)
|
37 (56.9)
|
|
Stroke
|
410 (6.6)
|
56 (10.7)
|
5885 (4.1)
|
4062 (2.5)
|
48 (7.1)
|
16 (10.8)
|
138 (27.8)
|
214 (40.5)
|
41 (9.6)
|
8 (15.7)
|
NA
|
NA
|
7 (0.9)
|
1 (1.5)
|
3 (1.3)
|
0 (0.0)
|
1 (0.4)
|
0 (0.0)
|
3 (1.7)
|
0 (0.0)
|
0 (0.0)
|
0 (0.0)
|
24 (7.0)
|
13 (12.3)
|
12 (5.7)
|
19 (5.7)
|
2 (3.4)
|
1 (5.9)
|
0 (0.0)
|
0 (0.0)
|
12 (3.1)
|
5 (10.2)
|
4 (2.6)
|
3 (1.8)
|
|
Alcoholic Drinks / Week
|
7.6 ± 5.4
|
6.0 ± 5.1
|
6.6 ± 5.2
|
5.8 ± 4.9
|
7.8 ± 10.5
|
2.4 ± 5.1
|
4.8 ± 9.6
|
2.7 ± 4.2
|
14.1 ± 19.2
|
4.4 ± 7.5
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
5.5 ± 14.3
|
1.9 ± 7.2
|
5.3 ± 10.9
|
2.4 ± 5.8
|
5.9 ± 8.6
|
2.2 ± 4.4
|
3.2 ± 6.0
|
3.8 ± 7.6
|
4.2 ± 9.0
|
0.9 ± 2.7
|
4.3 ± 11.0
|
1.2 ± 3.7
|
|
Sugar-Sweetened Drinks / Week
|
NA
|
NA
|
NA
|
NA
|
1.0 ± 1.5
|
0.6 ± 1.1
|
0.9 ± 1.3
|
0.5 ± 1.1
|
0.8 ± 1.3
|
0.7 ± 1.2
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
1.7 ± 1.9
|
1.0 ± 1.5
|
1.8 ± 2.5
|
1.2 ± 1.6
|
1.3 ± 1.7
|
0.8 ± 1.3
|
2.1 ± 2.2
|
1.4 ± 2.0
|
2.3 ± 2.2
|
1.4 ± 1.5
|
2.0 ± 1.7
|
1.4 ± 1.5
|
|
Current Smoker
|
349 (5.6)
|
38 (7.3)
|
12298 (8.6)
|
10641 (6.6)
|
22 (4.3)
|
5 (5.8)
|
26 (5.2)
|
20 (4.2)
|
94 (16.5)
|
14 (18.2)
|
NA
|
NA
|
NA
|
NA
|
40 (17.5)
|
0 (0.0)
|
30 (12.6)
|
0 (0.0)
|
35 (20.0)
|
0 (0.0)
|
19 (17.9)
|
0 (0.0)
|
35 (17.8)
|
4 (7.5)
|
41 (29.1)
|
45 (23.4)
|
16 (23.9)
|
3 (15.8)
|
4 (26.7)
|
4 (33.3)
|
12 (9.7)
|
2 (6.1)
|
23 (21.5)
|
15 (13.5)
|
|
Family History of Gout
|
NA
|
NA
|
NA
|
NA
|
402 (44.3)
|
89 (47.1)
|
58 (16.3)
|
76 (23.7)
|
270 (34.3)
|
40 (38.1)
|
3 (33.3)
|
2 (25.0)
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
239 (67.9)
|
80 (74.8)
|
87 (43.7)
|
148 (45.7)
|
79 (68.1)
|
22 (84.6)
|
15 (36.6)
|
19 (57.6)
|
237 (62.0)
|
32 (61.5)
|
66 (38.2)
|
61 (38.9)
|
|
No. Relatives w/ Gout
|
NA
|
NA
|
NA
|
NA
|
0.8 ± 1.0
|
1.0 ± 1.3
|
0.3 ± 0.6
|
0.5 ± 0.7
|
0.7 ± 0.8
|
0.9 ± 0.8
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
1.7 ± 2.2
|
1.8 ± 1.5
|
0.8 ± 1.3
|
0.7 ± 1.0
|
2.0 ± 1.9
|
2.2 ± 1.5
|
0.8 ± 1.0
|
1.3 ± 1.4
|
1.5 ± 1.9
|
1.5 ± 1.9
|
0.6 ± 0.9
|
0.6 ± 0.9
|
|
Note:
|
|
Flares reported as median (inter-quartile range). All other numeric variables reported as mean ± sd. All binary variables reported as N (%).
|
From this table we can make the following observations:
In general, there are around 5 to 10 times more male gout cases than female gout cases. Though the control male to female ratio is closer to 1:1.
Within each population group, gout cases are consistently older than controls by around 5 to 20 years. Male European gout cohorts are mostly 60 years old on average, except for Ardea cohorts who are younger (around 50 - 55 years old on average). Male East Polynesian gout cohorts are around 55 - 60 years old on average, while the male West Polynesian gout cases appear to be younger (average of 47 years old). Female gout cohorts are consistently older than their male counterparts, with an average age difference of around 10 years.
Serum urate levels for gout cohorts tend to be between 6.5 and 9 mg/dL, while control cohorts sit around 4 to 6 mg/dL on average. Women in general have lower serum urate than men.
Most gout cohorts have high rates of ULT (many are over 90% on ULT - though this is likely related to missing data issues). Similar ULT usage is seen in men and women, though perhaps slightly less women are prescribed ULT than men.
Mean age at gout onset is around 40 - 50 for male European cohorts and around 35 - 40 for Polynesian men. Women have a mean age at gout onset 5 to 20 years higher than men. Average disease duration is around 10 - 20 years for all male cohorts, while being about 5 to 10 years shorter for female cohorts.
Gout cohorts have up to a median of around 2 to 4 flares per year on average, with slightly higher flare rates among Ardea cohorts. There are no obvious difference in flares between men and women, nor between European and Polynesian cohorts.
Gout cohorts tend to have between 15 and 50% tophaceous gout. There are no obvious differences between men and women, nor between ancestries.
The mean gout PRS is around 4.0 to 4.3 for European gout cohorts, around 3.7 for European control cohorts, around 4.4 for East Polynesian gout cohorts, 4.8 for West Polynesian gout cohorts and around 4.2 to 4.3 for Polynesian control cohorts. There is no difference between men and women. Interestingly, the urate PRS is similar between all ancestral groups, though it is consistently slightly lower among controls versus cases.
For cohorts with prophylaxis data, almost all gout cases are on prophylaxis. No obvious sex differences. I don’t think this data is usable due to excessive missing data.
Male European gout cases have a mean BMI of 30, though male Ardea cohorts are all closer to 35. Female European gout cases have a mean BMI of 31 - 32 on average, though female Ardea cohorts are all around 37. European control cohorts typically have an average BMI of around 27/28 for women/men respectively. Male Polynesian cases have a mean BMI of ~36 while male controls have a BMI of ~32 on average. Polynesian women consistently have around 2 to 3 units higher BMI than their male counterparts, regardless of gout status.
All remaining variables cannot be interpreted due to missing data.
Based on this information, I will take the following action:
Remove any cohort with fewer than ~30 individuals (i.e. female Ardea cohorts) as they are unlikely to contribute to the meta-analyses (due to inverse variance weighting).
Ignore any phenotypes that are not phenotyped at more than 80% across all cohorts (excluding UK Biobank), given that there are likely biases as to inclusion in phenotyping (i.e. negative ULT cases reported as missing in some cohorts).
Plotting distribution of each variable
# Creating table for plotting the distributions of all variables of interest in each cohort, stratified by sex.
all_cohorts <- all_pheno_prs2 %>%
mutate(SEX = factor(SEX, levels = c("Male", "Female")),
GOUT2 = factor(GOUT, levels = c(TRUE, FALSE), labels = c("Gout", "Control")),
ANCESTRY_GOUT = factor(case_when(GOUT & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "European Gout",
!GOUT & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "European Control",
GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study != "Ngati Porou" ~ "East Polynesian Gout",
!GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study != "Ngati Porou" ~ "East Polynesian Control",
GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study == "Ngati Porou" ~ "East Polynesian Gout - NP",
!GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study == "Ngati Porou" ~ "East Polynesian Control - NP",
GOUT & Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian Gout",
!GOUT & Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian Control"),
levels = c("European Gout",
"European Control",
"East Polynesian Gout",
"East Polynesian Control",
"East Polynesian Gout - NP",
"East Polynesian Control - NP",
"West Polynesian Gout",
"West Polynesian Control")),
ANCESTRY_GOUT_SEX = factor(case_when(ANCESTRY_GOUT == "European Gout" & SEX == "Male" ~ "European Gout - male",
ANCESTRY_GOUT == "European Gout" & SEX == "Female" ~ "European Gout - female",
ANCESTRY_GOUT == "European Control" & SEX == "Male" ~ "European Control - male",
ANCESTRY_GOUT == "European Control" & SEX == "Female" ~ "European Control - female",
ANCESTRY_GOUT == "East Polynesian Gout" & SEX == "Male" ~ "East Polynesian Gout - male",
ANCESTRY_GOUT == "East Polynesian Gout" & SEX == "Female" ~ "East Polynesian Gout - female",
ANCESTRY_GOUT == "East Polynesian Control" & SEX == "Male" ~ "East Polynesian Control - male",
ANCESTRY_GOUT == "East Polynesian Control" & SEX == "Female" ~ "East Polynesian Control - female",
ANCESTRY_GOUT == "East Polynesian Gout - NP" & SEX == "Male" ~ "East Polynesian Gout - NP - male",
ANCESTRY_GOUT == "East Polynesian Gout - NP" & SEX == "Female" ~ "East Polynesian Gout - NP - female",
ANCESTRY_GOUT == "East Polynesian Control - NP" & SEX == "Male" ~ "East Polynesian Control - NP - male",
ANCESTRY_GOUT == "East Polynesian Control - NP" & SEX == "Female" ~ "East Polynesian Control - NP - female",
ANCESTRY_GOUT == "West Polynesian Gout" & SEX == "Male" ~ "West Polynesian Gout - male",
ANCESTRY_GOUT == "West Polynesian Gout" & SEX == "Female" ~ "West Polynesian Gout - female",
ANCESTRY_GOUT == "West Polynesian Control" & SEX == "Male" ~ "West Polynesian Control - male",
ANCESTRY_GOUT == "West Polynesian Control" & SEX == "Female" ~ "West Polynesian Control - female"),
levels = c("European Gout - male",
"European Gout - female",
"European Control - male",
"European Control - female",
"East Polynesian Gout - male",
"East Polynesian Gout - female",
"East Polynesian Control - male",
"East Polynesian Control - female",
"East Polynesian Gout - NP - male",
"East Polynesian Gout - NP - female",
"East Polynesian Control - NP - male",
"East Polynesian Control - NP - female",
"West Polynesian Gout - male",
"West Polynesian Gout - female",
"West Polynesian Control - male",
"West Polynesian Control - female")),
SEX_GOUT = factor(case_when(GOUT & SEX == "Male" ~ "Male Gout",
GOUT & SEX == "Female" ~ "Female Gout",
!GOUT & SEX == "Male" ~ "Male Control",
!GOUT & SEX == "Female" ~ "Female Control"),
levels = c("Male Gout",
"Female Gout",
"Male Control",
"Female Control")),
COHORT_GOUT = factor(case_when(GOUT & Pheno.Study == "UK Biobank" ~ "UK Biobank - Gout",
!GOUT & Pheno.Study == "UK Biobank" ~ "UK Biobank - Control",
GOUT & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") & Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease") ~ "Aus/NZ - Gout",
!GOUT & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") & Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease") ~ "Aus/NZ - Control",
GOUT & Pheno.Study == "EuroGout" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "GlobalGout - Gout",
!GOUT & Pheno.Study == "EuroGout" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "GlobalGout - Control",
Pheno.Study == "Ardea: 401" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "Ardea - LASSO",
Pheno.Study == "Ardea: CLEAR1" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "Ardea - CLEAR1",
Pheno.Study == "Ardea: CLEAR2" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "Ardea - CLEAR2",
Pheno.Study == "Ardea: CRYSTAL" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "Ardea - CRYSTAL",
Pheno.Study == "Ardea: LIGHT" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "Ardea - LIGHT",
GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study != "Ngati Porou" ~ "East Polynesian - Gout",
!GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study != "Ngati Porou" ~ "East Polynesian - Control",
GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study == "Ngati Porou" ~ "East Polynesian - Gout - NP",
!GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study == "Ngati Porou" ~ "East Polynesian - Control - NP",
GOUT & Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian - Gout",
!GOUT & Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian - Control"),
levels = c("UK Biobank - Gout",
"UK Biobank - Control",
"Aus/NZ - Gout",
"Aus/NZ - Control",
"GlobalGout - Gout",
"GlobalGout - Control",
"Ardea - LASSO",
"Ardea - CLEAR1",
"Ardea - CLEAR2",
"Ardea - CRYSTAL",
"Ardea - LIGHT",
"East Polynesian - Gout",
"East Polynesian - Control",
"East Polynesian - Gout - NP",
"East Polynesian - Control - NP",
"West Polynesian - Gout",
"West Polynesian - Control"))) %>%
filter(!is.na(COHORT_GOUT))
Age at Recruitment
# Plotting distribution of age at recruitment in each cohort, stratified by sex.
all_cohorts %>%
ggplot(aes(x = AGECOL, y = COHORT_GOUT, color = SEX)) +
geom_boxplot(position = position_dodge2(reverse = T)) +
labs(x = "Age at Recruitment (years)") +
theme(axis.title.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank()) +
scale_x_continuous(limits = c(0, 100)) +
scale_y_discrete(limits = rev(levels(all_cohorts$COHORT_GOUT)))

Serum Urate
# Plotting distribution of serum urate in each cohort, stratified by sex.
all_cohorts %>%
filter(!is.na(URATE)) %>%
ggplot(aes(x = URATE, y = COHORT_GOUT, color = SEX)) +
geom_boxplot(position = position_dodge2(reverse = T)) +
labs(x = "Serum Urate (mg/dL)") +
theme(axis.title.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank()) +
scale_x_continuous(limits = c(0, max(all_cohorts$URATE, na.rm = T))) +
scale_y_discrete(limits = rev(levels(all_cohorts$COHORT_GOUT)))

Urate-Lowering Therapy
# Plotting distribution of ULT usage in each cohort, stratified by sex.
all_cohorts %>%
filter(GOUT) %>%
mutate(ULT = factor(case_when(ULT ~ "On ULT",
!ULT ~ "Not on ULT",
is.na(ULT) ~ "No Data"),
levels = c("No Data", "Not on ULT", "On ULT"))) %>%
group_by(COHORT_GOUT, ULT, SEX) %>%
summarize(value = n()) %>%
ggplot(aes(x = COHORT_GOUT, y = value, fill = ULT, color = COHORT_GOUT)) +
geom_bar(position = "fill", stat = "identity") +
facet_wrap(~ SEX) +
scale_fill_discrete(type = c("black", "#C0C0C0", "#505050"), limits = c("On ULT", "Not on ULT", "No Data")) +
theme(axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank())

Age at Gout Onset
# Plotting distribution of age at onset in each cohort, stratified by sex.
all_cohorts %>%
filter(GOUT, !is.na(AGE1ATK)) %>%
ggplot(aes(x = AGE1ATK, y = COHORT_GOUT, color = SEX)) +
geom_boxplot(position = position_dodge2(reverse = T)) +
labs(x = "Age at Onset (years)") +
theme(axis.title.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank()) +
scale_x_continuous(limits = c(0, 100)) +
scale_y_discrete(limits = rev(levels(all_cohorts %>% filter(GOUT, !is.na(AGE1ATK)) %>% mutate(COHORT_GOUT = factor(COHORT_GOUT)) %>% pull(COHORT_GOUT))))

Age at Gout Onset 2
# Plotting distribution of age at onset in males in each cohort, stratified by ancestry.
all_cohorts %>%
filter(GOUT, !is.na(AGE1ATK), SEX == "Male") %>%
mutate(ANCESTRY_GOUT = case_when(ANCESTRY_GOUT == "East Polynesian Gout - NP" ~ "East Polynesian Gout",
TRUE ~ as.character(ANCESTRY_GOUT))) %>%
ggplot(aes(x = AGE1ATK, y = ..density.., color = COHORT_GOUT)) +
geom_freqpoly(bins = 30) +
labs(x = "Age at Onset (years)") +
theme(axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank()) +
scale_x_continuous(limits = c(0, 100)) +
facet_wrap(~ ANCESTRY_GOUT,
ncol = 1)

Disease Duration
# Plotting distribution of age at onset in each cohort, stratified by sex.
all_cohorts %>%
filter(GOUT, !is.na(DURATION)) %>%
ggplot(aes(x = DURATION, y = COHORT_GOUT, color = SEX)) +
geom_boxplot(position = position_dodge2(reverse = T)) +
labs(x = "Disease Duration (years)") +
theme(axis.title.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank()) +
scale_x_continuous(limits = c(0, max(all_cohorts$DURATION, na.rm = T))) +
scale_y_discrete(limits = rev(levels(all_cohorts %>% filter(GOUT, !is.na(DURATION)) %>% mutate(COHORT_GOUT = factor(COHORT_GOUT)) %>% pull(COHORT_GOUT))))

Flare Frequency
# Plotting distribution of flare frequency in each cohort, stratified by sex.
all_cohorts %>%
filter(GOUT, !is.na(NUMATK)) %>%
mutate(NUMATK = case_when(NUMATK > 52 ~ 52,
TRUE ~ NUMATK)) %>%
ggplot(aes(x = NUMATK, y = COHORT_GOUT, color = SEX)) +
geom_boxplot(position = position_dodge2(reverse = T)) +
labs(x = "Number of Flares in Last Year") +
theme(axis.title.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank()) +
scale_x_continuous(limits = c(0, 52)) +
scale_y_discrete(limits = rev(levels(all_cohorts %>% filter(GOUT, !is.na(NUMATK)) %>% mutate(COHORT_GOUT = factor(COHORT_GOUT)) %>% pull(COHORT_GOUT))))

Flare Frequency (categorical)
# Plotting distribution of flare frequency as a categorical variable in each cohort, stratified by sex.
all_cohorts %>%
filter(GOUT,
COHORT_GOUT != "UK Biobank - Gout") %>%
mutate(FLARE_CAT = factor(case_when(is.na(FLARE_CAT) ~ "No Data",
TRUE ~ as.character(FLARE_CAT)),
levels = rev(c(paste0(0:5),
"6 - 11",
"12 - 52",
"No Data")),
ordered = TRUE)) %>%
group_by(COHORT_GOUT, FLARE_CAT, SEX) %>%
summarize(value = n()) %>%
ggplot(aes(x = COHORT_GOUT, y = value, fill = FLARE_CAT, color = COHORT_GOUT)) +
geom_bar(position = "fill", stat = "identity") +
facet_wrap(~ SEX) +
scale_fill_discrete(type = c("#C0C0C0", "#FDE725FF", "#9FDA3AFF", "#4AC16DFF", "#1FA187FF", "#277F8EFF", "#365C8DFF", "#46337EFF", "#440154FF")) +
theme(axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank())

Tophaceous Disease
# Plotting distribution of tophaceous disease in each cohort, stratified by sex.
all_cohorts %>%
filter(GOUT,
COHORT_GOUT != "UK Biobank - Gout") %>%
mutate(TOPHIGOUT = factor(case_when(TOPHIGOUT ~ "Tophi",
!TOPHIGOUT ~ "No Tophi",
is.na(TOPHIGOUT) ~ "No Data"),
levels = c("No Data", "No Tophi", "Tophi"))) %>%
group_by(COHORT_GOUT, TOPHIGOUT, SEX) %>%
summarize(value = n()) %>%
ggplot(aes(x = COHORT_GOUT, y = value, fill = TOPHIGOUT, color = COHORT_GOUT)) +
geom_bar(position = "fill", stat = "identity") +
facet_wrap(~ SEX) +
scale_fill_discrete(type = c("black", "#C0C0C0", "#505050"), limits = c("Tophi", "No Tophi", "No Data")) +
theme(axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank())

Gout PRS
# Plotting distribution of gout PRS in each cohort, stratified by sex.
all_cohorts %>%
ggplot(aes(x = PRS, y = COHORT_GOUT, color = SEX)) +
geom_boxplot(position = position_dodge2(reverse = T)) +
labs(x = "Gout PRS") +
theme(axis.title.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank()) +
scale_y_discrete(limits = rev(levels(all_cohorts$COHORT_GOUT)))

all_cohorts %>%
ggplot(aes(x = PRS2, y = COHORT_GOUT, color = SEX)) +
geom_boxplot(position = position_dodge2(reverse = T)) +
labs(x = "Gout PRS") +
theme(axis.title.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank()) +
scale_y_discrete(limits = rev(levels(all_cohorts$COHORT_GOUT)))

Urate PRS
# Plotting distribution of urate PRS in each cohort, stratified by sex.
all_cohorts %>%
ggplot(aes(x = Urate_PRS, y = COHORT_GOUT, color = SEX)) +
geom_boxplot(position = position_dodge2(reverse = T)) +
labs(x = "Urate PRS") +
theme(axis.title.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank()) +
scale_y_discrete(limits = rev(levels(all_cohorts$COHORT_GOUT)))

all_cohorts %>%
ggplot(aes(x = Urate_PRS2, y = COHORT_GOUT, color = SEX)) +
geom_boxplot(position = position_dodge2(reverse = T)) +
labs(x = "Urate PRS") +
theme(axis.title.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank()) +
scale_y_discrete(limits = rev(levels(all_cohorts$COHORT_GOUT)))

Plotting relationship between age at recruitment and each outcome of interest, stratified by sex
Serum Urate vs Age at Recruitment, stratified by ULT usage
# Plotting relationship between age at recruitment and serum urate, stratified by ULT usage.
all_cohorts %>%
filter(!is.na(URATE),
Pheno.Study != "UK Biobank") %>%
mutate(ULT2 = factor(case_when(!ULT ~ "Not on ULT",
ULT ~ "On ULT",
is.na(ULT) ~ "No Data / Control"),
levels = c("On ULT", "Not on ULT", "No Data / Control"))) %>%
ggplot(mapping = aes(x = AGECOL, y = URATE, color = ANCESTRY_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX * ULT2) +
labs(x = "Mean Age at Recruitment (years)",
y = "Serum Urate (mg/dL)") +
theme(legend.title = element_blank())

Serum Urate vs Disease Duration, stratified by ULT usage
# Plotting relationship between disease duration and serum urate, stratified by ULT usage.
all_cohorts %>%
filter(!is.na(URATE),
Pheno.Study != "UK Biobank",
GOUT) %>%
mutate(ULT2 = factor(case_when(!ULT ~ "Not on ULT",
ULT ~ "On ULT",
is.na(ULT) ~ "No Data"),
levels = c("On ULT", "Not on ULT", "No Data"))) %>%
ggplot(mapping = aes(x = DURATION, y = URATE, color = ANCESTRY_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX * ULT2) +
labs(x = "Disease Duration (years)",
y = "Serum Urate (mg/dL)") +
theme(legend.title = element_blank())

ULT Usage vs Age at Recruitment
# Plotting relationship between ULT usage and age at recruitment.
all_cohorts %>%
filter(GOUT) %>%
mutate(ULT2 = factor(case_when(!ULT ~ "Not on ULT",
ULT ~ "On ULT",
is.na(ULT) ~ "No Data"),
levels = c("On ULT", "Not on ULT", "No Data"))) %>%
ggplot(mapping = aes(x = ULT2, y = AGECOL, fill = ANCESTRY_GOUT)) +
stat_summary(fun = mean, geom = "bar", position = position_dodge()) +
stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(width = 0.9), width = 0.3, alpha = 0.75) +
facet_wrap(~ SEX) +
labs(y = "Mean Age at Recruitment (years)") +
theme(axis.title.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.title = element_blank())

Age at Gout Onset vs Age at Recruitment
# Plotting relationship between age at onset and age at recruitment.
all_cohorts %>%
filter(!is.na(AGE1ATK)) %>%
ggplot(mapping = aes(x = AGECOL, y = AGE1ATK, color = COHORT_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX) +
labs(x = "Age at Recruitment (years)", y = "Age at Onset (years)") +
theme(legend.title = element_blank())

Disease Duration vs Age at Recruitment
# Plotting relationship between disease duration and age at recruitment.
all_cohorts %>%
filter(GOUT, !is.na(AGE1ATK)) %>%
ggplot(mapping = aes(x = AGECOL, y = DURATION, color = COHORT_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX) +
labs(x = "Age at Recruitment (years)", y = "Disease Duration (years)") +
theme(legend.title = element_blank())

Flare Frequency vs Age at Recruitment (in all individuals)
# Plotting relationship between categorized flare frequency and age at recruitment (in all individuals with gout).
all_cohorts %>%
filter(GOUT,
!is.na(FLARE_CAT),
!str_detect(COHORT_GOUT, "Ardea")) %>%
ggplot(aes(x = FLARE_CAT, y = AGECOL, color = COHORT_GOUT)) +
stat_summary(geom = "point", fun = mean) +
stat_summary(geom = "errorbar", fun.data = mean_se, width = 0.3) +
stat_summary(geom = "line", fun = mean, aes(group = COHORT_GOUT)) +
facet_wrap(~ SEX) +
labs(x = "Number of Flares in Last Year (categorical)", y = "Age at Recruitment (years)") +
theme(legend.title = element_blank())

Flare Frequency vs Age at Recruitment (in individuals with at least 2 flares in last year)
# Plotting relationship between categorized flare frequency and age at recruitment (in individuals with at least 2 flares in last year).
all_cohorts %>%
filter(GOUT,
!is.na(FLARE_CAT),
NUMATK >= 2) %>%
ggplot(aes(x = FLARE_CAT, y = AGECOL, color = COHORT_GOUT)) +
stat_summary(geom = "point", fun = mean) +
stat_summary(geom = "errorbar", fun.data = mean_se, width = 0.3) +
stat_summary(geom = "line", fun = mean, aes(group = COHORT_GOUT)) +
facet_wrap(~ SEX) +
labs(x = "Number of Flares in Last Year (categorical)", y = "Age at Recruitment (years)") +
theme(legend.title = element_blank())

Tophaceous Disease vs Age at Recruitment
# Plotting relationship between tophaceous disease and age at recruitment
all_cohorts %>%
filter(GOUT,
!str_detect(COHORT_GOUT, "CRYSTAL")) %>%
mutate(TOPHIGOUT = factor(case_when(TOPHIGOUT ~ "Tophi",
!TOPHIGOUT ~ "No Tophi",
is.na(TOPHIGOUT) ~ "No Data"),
levels = c("Tophi", "No Tophi", "No Data"))) %>%
ggplot(mapping = aes(x = TOPHIGOUT, y = AGECOL, fill = ANCESTRY_GOUT)) +
stat_summary(fun = mean, geom = "bar", position = position_dodge()) +
stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(width = 0.9), width = 0.3, alpha = 0.75) +
facet_wrap(~ SEX) +
labs(y = "Mean Age at Recruitment (years)") +
theme(axis.title.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.title = element_blank())

Plotting relationship between both PRS and each outcome of interest
Gout vs Gout PRS
# Plotting the relationship between the gout PRS and gout.
all_cohorts %>%
filter(!is.na(Geno.SpecificAncestry)) %>%
ggplot(aes(x = Geno.SpecificAncestry, y = PRS, fill = GOUT2)) +
stat_summary(fun = mean, geom = "bar", position = position_dodge()) +
stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(width = 0.9), width = 0.3, alpha = 0.75) +
facet_wrap(~ SEX) +
labs(y = "Gout PRS") +
scale_fill_discrete(limits = c("Gout", "Control")) +
theme(axis.title.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Gout vs Urate PRS
# Plotting the relationship between the urate PRS and gout.
all_cohorts %>%
filter(!is.na(Geno.SpecificAncestry)) %>%
ggplot(aes(x = Geno.SpecificAncestry, y = Urate_PRS, fill = GOUT2)) +
stat_summary(fun = mean, geom = "bar", position = position_dodge()) +
stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(width = 0.9), width = 0.3, alpha = 0.75) +
facet_wrap(~ SEX) +
labs(y = "Urate PRS") +
scale_fill_discrete(limits = c("Gout", "Control")) +
theme(axis.title.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Serum urate vs Gout PRS
# Plotting the relationship between the gout PRS and serum urate.
all_cohorts %>%
filter(!is.na(URATE),
Pheno.Study != "UK Biobank") %>%
mutate(ULT2 = factor(case_when(!ULT ~ "Not on ULT",
ULT ~ "On ULT",
is.na(ULT) ~ "No Data / Control"),
levels = c("On ULT", "Not on ULT", "No Data / Control"))) %>%
ggplot(mapping = aes(x = PRS, y = URATE, color = ANCESTRY_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX * ULT2) +
labs(x = "Gout PRS",
y = "Serum Urate (mg/dL)") +
theme(legend.title = element_blank())

Serum urate vs Urate PRS
# Plotting the relationship between the urate PRS and serum urate.
all_cohorts %>%
filter(!is.na(URATE),
Pheno.Study != "UK Biobank") %>%
mutate(ULT2 = factor(case_when(!ULT ~ "Not on ULT",
ULT ~ "On ULT",
is.na(ULT) ~ "No Data / Control"),
levels = c("On ULT", "Not on ULT", "No Data / Control"))) %>%
ggplot(mapping = aes(x = Urate_PRS, y = URATE, color = ANCESTRY_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX * ULT2) +
labs(x = "Gout PRS",
y = "Serum Urate (mg/dL)") +
theme(legend.title = element_blank())

ULT vs Gout PRS
# Plotting the relationship between ULT usage and the gout PRS.
all_cohorts %>%
filter(GOUT) %>%
mutate(ULT2 = factor(case_when(!ULT ~ "Not on ULT",
ULT ~ "On ULT",
is.na(ULT) ~ "No Data"),
levels = c("On ULT", "Not on ULT", "No Data"))) %>%
ggplot(mapping = aes(x = ANCESTRY_GOUT, y = PRS, fill = ULT2)) +
stat_summary(fun = mean, geom = "bar", position = position_dodge()) +
stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(width = 0.9), width = 0.3, alpha = 0.75) +
facet_wrap(~ SEX) +
labs(y = "Gout PRS") +
theme(axis.title.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

ULT vs Urate PRS
# Plotting the relationship between ULT usage and the urate PRS.
all_cohorts %>%
filter(GOUT) %>%
mutate(ULT2 = factor(case_when(!ULT ~ "Not on ULT",
ULT ~ "On ULT",
is.na(ULT) ~ "No Data"),
levels = c("On ULT", "Not on ULT", "No Data"))) %>%
ggplot(mapping = aes(x = ANCESTRY_GOUT, y = Urate_PRS, fill = ULT2)) +
stat_summary(fun = mean, geom = "bar", position = position_dodge()) +
stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(width = 0.9), width = 0.3, alpha = 0.75) +
facet_wrap(~ SEX) +
labs(y = "Gout PRS") +
theme(axis.title.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Age at Onset vs Gout PRS
# Plotting the relationship between age at onset and the gout PRS.
all_cohorts %>%
filter(!is.na(AGE1ATK)) %>%
ggplot(mapping = aes(x = PRS, y = AGE1ATK, color = COHORT_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX) +
labs(x = "Gout PRS", y = "Age at Onset (years)") +
theme(legend.title = element_blank())

Age at Onset vs Urate PRS
# Plotting the relationship between age at onset and the urate PRS.
all_cohorts %>%
filter(!is.na(AGE1ATK)) %>%
ggplot(mapping = aes(x = Urate_PRS, y = AGE1ATK, color = COHORT_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX) +
labs(x = "Urate PRS", y = "Age at Onset (years)") +
theme(legend.title = element_blank())

Disease Duration vs Gout PRS
# Plotting the relationship between disease duration and the gout PRS.
all_cohorts %>%
filter(!is.na(DURATION)) %>%
ggplot(mapping = aes(x = PRS, y = DURATION, color = COHORT_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX) +
labs(x = "Gout PRS", y = "Disease Duration (years)") +
theme(legend.title = element_blank())

Disease Duration vs Urate PRS
# Plotting the relationship between disease duration and the urate PRS.
all_cohorts %>%
filter(!is.na(DURATION)) %>%
ggplot(mapping = aes(x = Urate_PRS, y = DURATION, color = COHORT_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX) +
labs(x = "Urate PRS", y = "Disease Duration (years)") +
theme(legend.title = element_blank())

Flare Frequency vs Gout PRS (all individuals)
# Plotting the relationship between flare frequency and the gout PRS in all individuals.
all_cohorts %>%
filter(GOUT, !is.na(FLARE_CAT),
!str_detect(COHORT_GOUT, "Ardea")) %>%
ggplot(aes(x = FLARE_CAT, y = PRS, color = COHORT_GOUT)) +
stat_summary(geom = "point", fun = mean) +
stat_summary(geom = "errorbar", fun.data = mean_se, width = 0.3) +
stat_summary(geom = "line", fun = mean, aes(group = COHORT_GOUT)) +
facet_wrap(~ SEX) +
labs(x = "Number of Flares in Last Year (categorical)", y = "Gout PRS") +
theme(legend.title = element_blank())

Flare Frequency vs Urate PRS (all individuals)
# Plotting the relationship between flare frequency and the urate PRS in all individuals.
all_cohorts %>%
filter(GOUT, !is.na(FLARE_CAT),
!str_detect(COHORT_GOUT, "Ardea")) %>%
ggplot(aes(x = FLARE_CAT, y = Urate_PRS, color = COHORT_GOUT)) +
stat_summary(geom = "point", fun = mean) +
stat_summary(geom = "errorbar", fun.data = mean_se, width = 0.3) +
stat_summary(geom = "line", fun = mean, aes(group = COHORT_GOUT)) +
facet_wrap(~ SEX) +
labs(x = "Number of Flares in Last Year (categorical)", y = "Urate PRS") +
theme(legend.title = element_blank())

Flare Frequency vs Gout PRS (> 1 flares only)
# Plotting the relationship between flare frequency and the gout PRS in all individuals with > 1 flares in the last year.
all_cohorts %>%
filter(GOUT, !is.na(FLARE_CAT),
NUMATK >= 2) %>%
ggplot(aes(x = FLARE_CAT, y = PRS, color = COHORT_GOUT)) +
stat_summary(geom = "point", fun = mean) +
stat_summary(geom = "errorbar", fun.data = mean_se, width = 0.3) +
stat_summary(geom = "line", fun = mean, aes(group = COHORT_GOUT)) +
facet_wrap(~ SEX) +
labs(x = "Number of Flares in Last Year (categorical)", y = "Gout PRS") +
theme(legend.title = element_blank())

Flare Frequency vs Urate PRS (> 1 flares only)
# Plotting the relationship between flare frequency and the urate PRS in all individuals with > 1 flares in the last year.
all_cohorts %>%
filter(GOUT, !is.na(FLARE_CAT),
NUMATK >= 2) %>%
ggplot(aes(x = FLARE_CAT, y = Urate_PRS, color = COHORT_GOUT)) +
stat_summary(geom = "point", fun = mean) +
stat_summary(geom = "errorbar", fun.data = mean_se, width = 0.3) +
stat_summary(geom = "line", fun = mean, aes(group = COHORT_GOUT)) +
facet_wrap(~ SEX) +
labs(x = "Number of Flares in Last Year (categorical)", y = "Urate PRS") +
theme(legend.title = element_blank())

Tophaceous Disease vs Gout PRS
# Plotting the relationship between tophaceous disease and the gout PRS.
all_cohorts %>%
filter(GOUT,
!str_detect(COHORT_GOUT, "CRYSTAL")) %>%
mutate(TOPHIGOUT = factor(case_when(TOPHIGOUT ~ "Tophi",
!TOPHIGOUT ~ "No Tophi",
is.na(TOPHIGOUT) ~ "No Data"),
levels = c("Tophi", "No Tophi", "No Data"))) %>%
ggplot(mapping = aes(x = ANCESTRY_GOUT, y = PRS, fill = TOPHIGOUT)) +
stat_summary(fun = mean, geom = "bar", position = position_dodge()) +
stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(width = 0.9), width = 0.3, alpha = 0.75) +
facet_wrap(~ SEX) +
labs(y = "Gout PRS") +
theme(axis.title.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Tophaceous Disease vs Urate PRS
# Plotting the relationship between tophaceous disease and the urate PRS.
all_cohorts %>%
filter(GOUT,
!str_detect(COHORT_GOUT, "CRYSTAL")) %>%
mutate(TOPHIGOUT = factor(case_when(TOPHIGOUT ~ "Tophi",
!TOPHIGOUT ~ "No Tophi",
is.na(TOPHIGOUT) ~ "No Data"),
levels = c("Tophi", "No Tophi", "No Data"))) %>%
ggplot(mapping = aes(x = ANCESTRY_GOUT, y = Urate_PRS, fill = TOPHIGOUT)) +
stat_summary(fun = mean, geom = "bar", position = position_dodge()) +
stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(width = 0.9), width = 0.3, alpha = 0.75) +
facet_wrap(~ SEX) +
labs(y = "Urate PRS") +
theme(axis.title.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Plotting relationship between each PRS and age at recruitment
Age at Recruitment vs Gout PRS
# Plotting the relationship between age at recruitment and the gout PRS.
all_cohorts %>%
filter(Pheno.Study != "UK Biobank") %>%
ggplot(mapping = aes(x = AGECOL, y = PRS, color = ANCESTRY_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX) +
labs(x = "Mean Age at Recruitment (years)",
y = "Gout PRS") +
theme(legend.title = element_blank())

Age at Recruitment vs Urate PRS
# Plotting the relationship between age at recruitment and the urate PRS.
all_cohorts %>%
filter(Pheno.Study != "UK Biobank") %>%
ggplot(mapping = aes(x = AGECOL, y = Urate_PRS, color = ANCESTRY_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX) +
labs(x = "Mean Age at Recruitment (years)",
y = "Urate PRS") +
theme(legend.title = element_blank())

Plotting relationship between severity traits
Age at Onset vs Tophaceous Disease
# Plotting relationship between age at onset and tophaceous disease.
all_cohorts %>%
filter(!is.na(AGE1ATK),
!str_detect(COHORT_GOUT, "CRYSTAL")) %>%
mutate(TOPHIGOUT = factor(case_when(is.na(TOPHIGOUT) ~ "No Data",
TOPHIGOUT ~ "Tophi",
!TOPHIGOUT ~ "No Tophi"),
levels = c("Tophi", "No Tophi", "No Data"))) %>%
ggplot(aes(y = AGE1ATK, x = ANCESTRY_GOUT, color = TOPHIGOUT)) +
geom_boxplot() +
facet_wrap(~ SEX) +
labs(y = "Age at Onset (years)") +
theme(axis.title.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Disease Duration vs Tophaceous Disease
# Plotting relationship between disease duration and tophaceous disease.
all_cohorts %>%
filter(!is.na(DURATION),
!str_detect(COHORT_GOUT, "CRYSTAL")) %>%
mutate(TOPHIGOUT = factor(case_when(is.na(TOPHIGOUT) ~ "No Data",
TOPHIGOUT ~ "Tophi",
!TOPHIGOUT ~ "No Tophi"),
levels = c("Tophi", "No Tophi", "No Data"))) %>%
ggplot(aes(y = DURATION, x = ANCESTRY_GOUT, color = TOPHIGOUT)) +
geom_boxplot() +
facet_wrap(~ SEX) +
labs(y = "Disease Duration (years)") +
theme(axis.title.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Flare Frequency (all individuals) vs Tophaceous Disease
# Plotting relationship between flare frequency and tophaceous disease in all individuals.
all_cohorts %>%
filter(!is.na(NUMATK),
!str_detect(COHORT_GOUT, "CRYSTAL"),
!str_detect(COHORT_GOUT, "Ardea")) %>%
mutate(TOPHIGOUT = factor(case_when(is.na(TOPHIGOUT) ~ "No Data",
TOPHIGOUT ~ "Tophi",
!TOPHIGOUT ~ "No Tophi"),
levels = c("Tophi", "No Tophi", "No Data")),
NUMATK = case_when(NUMATK > 52 ~ 52,
TRUE ~ NUMATK)) %>%
ggplot(aes(y = NUMATK, x = ANCESTRY_GOUT, color = TOPHIGOUT)) +
geom_boxplot() +
facet_wrap(~ SEX) +
labs(y = "Number of Flares in Last Year") +
theme(axis.title.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Flare Frequency (> 1 flares only) vs Tophaceous Disease
# Plotting relationship between flare frequency and tophaceous disease in all individuals with > 1 flare in the last year.
all_cohorts %>%
filter(!is.na(NUMATK),
!str_detect(COHORT_GOUT, "CRYSTAL"),
NUMATK >= 2) %>%
mutate(TOPHIGOUT = factor(case_when(is.na(TOPHIGOUT) ~ "No Data",
TOPHIGOUT ~ "Tophi",
!TOPHIGOUT ~ "No Tophi"),
levels = c("Tophi", "No Tophi", "No Data")),
NUMATK = case_when(NUMATK > 52 ~ 52,
TRUE ~ NUMATK)) %>%
ggplot(aes(y = NUMATK, x = ANCESTRY_GOUT, color = TOPHIGOUT)) +
geom_boxplot() +
facet_wrap(~ SEX) +
labs(y = "Number of Flares in Last Year") +
theme(axis.title.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Flare Frequency (all individuals - categorical) vs Tophaceous Disease
# Plotting relationship between flare frequency (categorical) and tophaceous disease in all individuals.
all_cohorts %>%
filter(GOUT,
!str_detect(COHORT_GOUT, "CRYSTAL"),
!str_detect(COHORT_GOUT, "Ardea")) %>%
mutate(FLARE_CAT = factor(case_when(is.na(FLARE_CAT) ~ "No Data",
TRUE ~ as.character(FLARE_CAT)),
levels = rev(c(paste0(0:5),
"6 - 11",
"12 - 52",
"No Data")),
ordered = TRUE),
TOPHIGOUT = factor(case_when(is.na(TOPHIGOUT) ~ "No Data",
TOPHIGOUT ~ "Tophi",
!TOPHIGOUT ~ "No Tophi"),
levels = c("Tophi", "No Tophi", "No Data"))) %>%
group_by(ANCESTRY_GOUT, TOPHIGOUT, FLARE_CAT, SEX) %>%
summarize(value = n()) %>%
ggplot(aes(fill = FLARE_CAT, y = value, x = TOPHIGOUT, color = TOPHIGOUT)) +
geom_bar(position = "fill", stat = "identity") +
facet_wrap(~ SEX * ANCESTRY_GOUT) +
scale_fill_discrete(type = c("#C0C0C0", "#FDE725FF", "#9FDA3AFF", "#4AC16DFF", "#1FA187FF", "#277F8EFF", "#365C8DFF", "#46337EFF", "#440154FF")) +
theme(axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank())

Flare Frequency (> 1 flares only - categorical) vs Tophaceous Disease
# Plotting relationship between flare frequency (categorical) and tophaceous disease in all individuals with > 1 flares.
all_cohorts %>%
filter(GOUT,
!str_detect(COHORT_GOUT, "CRYSTAL"),
NUMATK >= 2) %>%
mutate(FLARE_CAT = factor(case_when(is.na(FLARE_CAT) ~ "No Data",
TRUE ~ as.character(FLARE_CAT)),
levels = rev(c(paste0(0:5),
"6 - 11",
"12 - 52",
"No Data")),
ordered = TRUE),
TOPHIGOUT = factor(case_when(is.na(TOPHIGOUT) ~ "No Data",
TOPHIGOUT ~ "Tophi",
!TOPHIGOUT ~ "No Tophi"),
levels = c("Tophi", "No Tophi", "No Data"))) %>%
group_by(ANCESTRY_GOUT, TOPHIGOUT, FLARE_CAT, SEX) %>%
summarize(value = n()) %>%
ggplot(aes(fill = FLARE_CAT, y = value, x = TOPHIGOUT, color = TOPHIGOUT)) +
geom_bar(position = "fill", stat = "identity") +
facet_wrap(~ SEX * ANCESTRY_GOUT) +
scale_fill_discrete(type = c("#C0C0C0", "#FDE725FF", "#9FDA3AFF", "#4AC16DFF", "#1FA187FF", "#277F8EFF", "#365C8DFF", "#46337EFF", "#440154FF")) +
theme(axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank())

Age at Onset vs Flare Frequency (all individuals)
# Plotting relationship between age at gout onset and flare frequency in all individuals.
all_cohorts %>%
filter(!is.na(AGE1ATK),
!is.na(NUMATK),
!str_detect(COHORT_GOUT, "Ardea")) %>%
mutate(NUMATK = case_when(NUMATK > 52 ~ 52,
TRUE ~ NUMATK)) %>%
ggplot(aes(x = AGE1ATK, y = NUMATK, color = ANCESTRY_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX) +
labs(x = "Age at Onset (years)",
y = "Number of Flares in Last Year") +
theme(legend.title = element_blank())

Age at Onset vs Flare Frequency (> 1 flares only)
# Plotting relationship between age at gout onset and flare frequency in all individuals with > 1 flares.
all_cohorts %>%
filter(!is.na(AGE1ATK),
!is.na(NUMATK),
NUMATK >= 2) %>%
mutate(NUMATK = case_when(NUMATK > 52 ~ 52,
TRUE ~ NUMATK)) %>%
ggplot(aes(x = AGE1ATK, y = NUMATK, color = ANCESTRY_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX) +
labs(x = "Age at Onset (years)",
y = "Number of Flares in Last Year") +
theme(legend.title = element_blank())

Duration vs Flare Frequency (all individuals)
# Plotting relationship between disease duration and flare frequency in all individuals.
all_cohorts %>%
filter(!is.na(DURATION),
!is.na(NUMATK),
!str_detect(COHORT_GOUT, "Ardea")) %>%
mutate(NUMATK = case_when(NUMATK > 52 ~ 52,
TRUE ~ NUMATK)) %>%
ggplot(aes(x = DURATION, y = NUMATK, color = ANCESTRY_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX) +
labs(x = "Disease Duration (years)",
y = "Number of Flares in Last Year") +
theme(legend.title = element_blank())

Duration vs Flare Frequency (> 1 flares only)
# Plotting relationship between disease duration and flare frequency in all individuals with > 1 flares.
all_cohorts %>%
filter(!is.na(DURATION),
!is.na(NUMATK),
NUMATK >= 2) %>%
mutate(NUMATK = case_when(NUMATK > 52 ~ 52,
TRUE ~ NUMATK)) %>%
ggplot(aes(x = DURATION, y = NUMATK, color = ANCESTRY_GOUT)) +
geom_point(size = 0.5, alpha = 0.5, shape = 1) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ SEX) +
labs(x = "Disease Duration (years)",
y = "Number of Flares in Last Year") +
theme(legend.title = element_blank())

Age at Onset vs Flare Frequency (all individuals - categorical)
# Plotting relationship between age at gout onset and flare frequency (categorical) in all individuals.
all_cohorts %>%
filter(GOUT, !is.na(FLARE_CAT),
!str_detect(COHORT_GOUT, "Ardea")) %>%
ggplot(aes(x = FLARE_CAT, y = AGE1ATK, color = COHORT_GOUT)) +
stat_summary(geom = "point", fun = mean) +
stat_summary(geom = "errorbar", fun.data = mean_se, width = 0.3) +
stat_summary(geom = "line", fun = mean, aes(group = COHORT_GOUT)) +
facet_wrap(~ SEX) +
labs(x = "Number of Flares in Last Year (categorical)", y = "Age at Onset (years)") +
theme(legend.title = element_blank())

Age at Onset vs Flare Frequency (> 1 flares only - categorical)
# Plotting relationship between age at gout onset and flare frequency (categorical) in all individuals with > 1 flares.
all_cohorts %>%
filter(GOUT, !is.na(FLARE_CAT),
NUMATK >= 2) %>%
ggplot(aes(x = FLARE_CAT, y = AGE1ATK, color = COHORT_GOUT)) +
stat_summary(geom = "point", fun = mean) +
stat_summary(geom = "errorbar", fun.data = mean_se, width = 0.3) +
stat_summary(geom = "line", fun = mean, aes(group = COHORT_GOUT)) +
facet_wrap(~ SEX) +
labs(x = "Number of Flares in Last Year (categorical)", y = "Age at Onset (years)") +
theme(legend.title = element_blank())

Duration vs Flare Frequency (all individuals - categorical)
# Plotting relationship between disease duration and flare frequency (categorical) in all individuals.
all_cohorts %>%
filter(GOUT, !is.na(FLARE_CAT),
!str_detect(COHORT_GOUT, "Ardea")) %>%
ggplot(aes(x = FLARE_CAT, y = DURATION, color = COHORT_GOUT)) +
stat_summary(geom = "point", fun = mean) +
stat_summary(geom = "errorbar", fun.data = mean_se, width = 0.3) +
stat_summary(geom = "line", fun = mean, aes(group = COHORT_GOUT)) +
facet_wrap(~ SEX) +
labs(x = "Number of Flares in Last Year (categorical)", y = "Disease Duration (years)") +
theme(legend.title = element_blank())

Duration vs Flare Frequency (> 1 flares only - categorical)
# Plotting relationship between disease duration and flare frequency (categorical) in all individuals with > 1 flares.
all_cohorts %>%
filter(GOUT, !is.na(FLARE_CAT),
NUMATK >= 2) %>%
ggplot(aes(x = FLARE_CAT, y = DURATION, color = COHORT_GOUT)) +
stat_summary(geom = "point", fun = mean) +
stat_summary(geom = "errorbar", fun.data = mean_se, width = 0.3) +
stat_summary(geom = "line", fun = mean, aes(group = COHORT_GOUT)) +
facet_wrap(~ SEX) +
labs(x = "Number of Flares in Last Year (categorical)", y = "Disease Duration (years)") +
theme(legend.title = element_blank())

Minor Allele Frequency for Each Gout PRS SNP in Each Cohort
# Making a table with SNPs as rows and columns representing MAF for each cohort for that SNP.
# Extracting the RSID list from the gout PRS table.
SNPlist <- UKBB_Gene_OR$RSID
# Writing function for extracting minor allele frequency for each SNP.
get_maf <- function(cohort, snps){
# Extracting all SNP columns and converting them to factors plus removing any individuals that are missing any SNPs.
test <- cohort %>%
select(all_of(snps)) %>%
mutate_all(factor, levels = 0:2) %>%
na.omit()
# Making tmp variable.
tmp <- c()
# For each column in test (each SNP), extract the
for(i in 1:ncol(test)){
tmp[i] <- ((sum(test[[i]] == 1) + (sum(test[[i]] == 2) * 2)) / (2 * nrow(test))) %>% sprintf(fmt = "%#.3f")
}
return(tmp)
}
# Making table of minor allele frequencies.
freq_table <- tibble("Cohort" = rep(cohortstring, each = length(SNPlist)),
"MAF" = unlist(lapply(data_list2, function(x) get_maf(cohort = x, snps = SNPlist))),
"SNP" = rep(SNPlist, length(cohortstring))) %>%
pivot_wider(names_from = SNP, values_from = MAF)
# Cleaning up table for viewing.
freq_table <- transpose_df(freq_table) %>%
column_to_rownames(var = "Cohort") %>%
mutate(across(.cols = 1:ncol(freq_table), ~ str_replace(string = .x, pattern = " ", replacement = " ")))
# Further cleaning up table for viewing.
row.names(freq_table) <- str_replace(row.names(freq_table), " ", " ")
# Printing table.
freq_table %>%
kable(col.names = clean_names,
align = "c",
escape = F) %>%
kable_styling("striped") %>%
scroll_box(width = "900px", height = "475px")
|
|
UK Biobank Gout Male
|
UK Biobank Gout Female
|
UK Biobank Control Male
|
UK Biobank Control Female
|
Aus/NZ European Gout Male
|
Aus/NZ European Gout Female
|
Aus/NZ European Control Male
|
Aus/NZ European Control Female
|
GlobalGout Gout Male
|
GlobalGout Gout Female
|
GlobalGout Control Male
|
GlobalGout Control Female
|
Ardea LASSO Gout Male
|
Ardea LASSO Gout Female
|
Ardea CLEAR1 Gout Male
|
Ardea CLEAR1 Gout Female
|
Ardea CLEAR2 Gout Male
|
Ardea CLEAR2 Gout Female
|
Ardea CRYSTAL Gout Male
|
Ardea CRYSTAL Gout Female
|
Ardea LIGHT Gout Male
|
Ardea LIGHT Gout Female
|
East Polynesian Gout Male
|
East Polynesian Gout Female
|
East Polynesian Control Male
|
East Polynesian Control Female
|
East Polynesian Gout Male NP
|
East Polynesian Gout Female NP
|
East Polynesian Control Male NP
|
East Polynesian Control Female NP
|
West Polynesian Gout Male
|
West Polynesian Gout Female
|
West Polynesian Control Male
|
West Polynesian Control Female
|
|
rs10910845
|
0.501
|
0.454
|
0.461
|
0.461
|
0.491
|
0.460
|
0.473
|
0.477
|
0.496
|
0.496
|
0.591
|
0.481
|
0.537
|
0.508
|
0.504
|
0.344
|
0.531
|
0.357
|
0.520
|
0.375
|
0.560
|
0.625
|
0.686
|
0.721
|
0.676
|
0.632
|
0.673
|
0.750
|
0.705
|
0.676
|
0.704
|
0.657
|
0.653
|
0.636
|
|
rs11264341
|
0.598
|
0.606
|
0.572
|
0.569
|
0.602
|
0.560
|
0.554
|
0.542
|
0.609
|
0.569
|
0.568
|
0.608
|
0.587
|
0.631
|
0.585
|
0.656
|
0.609
|
0.714
|
0.649
|
0.625
|
0.624
|
0.562
|
0.450
|
0.443
|
0.432
|
0.461
|
0.440
|
0.536
|
0.455
|
0.419
|
0.396
|
0.537
|
0.402
|
0.353
|
|
rs1260326
|
0.442
|
0.456
|
0.392
|
0.392
|
0.449
|
0.486
|
0.396
|
0.389
|
0.465
|
0.387
|
0.545
|
0.513
|
0.449
|
0.485
|
0.476
|
0.531
|
0.506
|
0.643
|
0.471
|
0.375
|
0.477
|
0.438
|
0.346
|
0.361
|
0.303
|
0.277
|
0.278
|
0.339
|
0.330
|
0.378
|
0.354
|
0.315
|
0.249
|
0.297
|
|
rs9847710
|
0.452
|
0.456
|
0.425
|
0.424
|
0.453
|
0.440
|
0.429
|
0.414
|
0.448
|
0.484
|
0.386
|
0.386
|
0.457
|
0.438
|
0.493
|
0.438
|
0.406
|
0.500
|
0.434
|
0.375
|
0.445
|
0.688
|
0.462
|
0.496
|
0.432
|
0.473
|
0.431
|
0.464
|
0.443
|
0.446
|
0.654
|
0.602
|
0.643
|
0.639
|
|
rs7675964
|
0.813
|
0.841
|
0.723
|
0.727
|
0.805
|
0.838
|
0.728
|
0.721
|
0.800
|
0.798
|
0.693
|
0.753
|
0.805
|
0.808
|
0.839
|
0.781
|
0.810
|
0.929
|
0.829
|
1.000
|
0.780
|
0.938
|
0.779
|
0.795
|
0.778
|
0.741
|
0.770
|
0.839
|
0.716
|
0.770
|
0.578
|
0.667
|
0.530
|
0.519
|
|
rs4481233
|
0.882
|
0.917
|
0.808
|
0.811
|
0.882
|
0.895
|
0.815
|
0.805
|
0.869
|
0.887
|
0.830
|
0.861
|
0.882
|
0.908
|
0.911
|
0.875
|
0.893
|
0.929
|
0.903
|
1.000
|
0.872
|
0.938
|
0.971
|
0.975
|
0.961
|
0.958
|
0.956
|
0.982
|
0.943
|
0.905
|
0.976
|
0.972
|
0.975
|
0.947
|
|
rs6811287
|
0.621
|
0.634
|
0.548
|
0.550
|
0.613
|
0.650
|
0.543
|
0.557
|
0.605
|
0.649
|
0.568
|
0.608
|
0.643
|
0.585
|
0.641
|
0.656
|
0.613
|
0.643
|
0.637
|
0.625
|
0.601
|
0.625
|
0.607
|
0.652
|
0.564
|
0.598
|
0.573
|
0.679
|
0.625
|
0.446
|
0.689
|
0.657
|
0.749
|
0.681
|
|
rs2231142
|
0.207
|
0.162
|
0.110
|
0.114
|
0.216
|
0.188
|
0.110
|
0.117
|
0.191
|
0.165
|
0.114
|
0.089
|
0.228
|
0.169
|
0.246
|
0.281
|
0.226
|
0.214
|
0.283
|
0.125
|
0.193
|
0.000
|
0.115
|
0.074
|
0.066
|
0.072
|
0.089
|
0.071
|
0.057
|
0.041
|
0.462
|
0.426
|
0.236
|
0.231
|
|
rs10011796
|
0.540
|
0.506
|
0.458
|
0.460
|
0.538
|
0.512
|
0.457
|
0.468
|
0.529
|
0.516
|
0.534
|
0.456
|
0.555
|
0.608
|
0.537
|
0.625
|
0.596
|
0.714
|
0.560
|
0.500
|
0.583
|
0.438
|
0.599
|
0.635
|
0.552
|
0.555
|
0.548
|
0.536
|
0.534
|
0.541
|
0.678
|
0.630
|
0.613
|
0.594
|
|
rs1165196
|
0.610
|
0.602
|
0.566
|
0.569
|
0.605
|
0.638
|
0.580
|
0.579
|
0.628
|
0.593
|
0.557
|
0.475
|
0.649
|
0.600
|
0.602
|
0.656
|
0.617
|
0.714
|
0.606
|
0.750
|
0.610
|
0.688
|
0.734
|
0.713
|
0.695
|
0.707
|
0.669
|
0.714
|
0.727
|
0.595
|
0.763
|
0.759
|
0.704
|
0.700
|
|
rs853685
|
0.182
|
0.180
|
0.165
|
0.165
|
0.170
|
0.145
|
0.154
|
0.162
|
0.159
|
0.185
|
0.136
|
0.101
|
0.145
|
0.185
|
0.139
|
0.188
|
0.178
|
0.143
|
0.166
|
0.125
|
0.142
|
0.125
|
0.091
|
0.123
|
0.108
|
0.089
|
0.097
|
0.161
|
0.125
|
0.135
|
0.173
|
0.130
|
0.123
|
0.117
|
|
rs3812316
|
0.894
|
0.902
|
0.870
|
0.871
|
0.896
|
0.881
|
0.875
|
0.867
|
0.890
|
0.879
|
0.920
|
0.880
|
0.890
|
0.954
|
0.911
|
0.875
|
0.895
|
0.929
|
0.914
|
1.000
|
0.913
|
1.000
|
0.974
|
0.984
|
0.967
|
0.958
|
0.952
|
0.946
|
0.989
|
0.973
|
0.990
|
0.991
|
0.982
|
0.978
|
|
rs1171616
|
0.801
|
0.782
|
0.768
|
0.768
|
0.791
|
0.788
|
0.776
|
0.779
|
0.795
|
0.786
|
0.761
|
0.804
|
0.808
|
0.792
|
0.809
|
0.844
|
0.833
|
0.714
|
0.800
|
0.625
|
0.844
|
0.812
|
0.949
|
0.959
|
0.936
|
0.949
|
0.935
|
0.946
|
0.943
|
0.878
|
0.986
|
0.963
|
0.977
|
0.975
|
|
rs17300741
|
0.501
|
0.477
|
0.448
|
0.451
|
0.491
|
0.405
|
0.489
|
0.465
|
0.528
|
0.560
|
0.466
|
0.582
|
0.507
|
0.585
|
0.509
|
0.469
|
0.533
|
0.500
|
0.520
|
0.500
|
0.518
|
0.750
|
0.797
|
0.807
|
0.793
|
0.771
|
0.766
|
0.768
|
0.807
|
0.730
|
0.818
|
0.778
|
0.859
|
0.861
|
|
rs7937990
|
0.214
|
0.189
|
0.184
|
0.187
|
0.201
|
0.195
|
0.181
|
0.205
|
0.236
|
0.222
|
0.205
|
0.158
|
0.225
|
0.192
|
0.222
|
0.375
|
0.285
|
0.143
|
0.243
|
0.000
|
0.248
|
0.250
|
0.456
|
0.447
|
0.461
|
0.408
|
0.403
|
0.500
|
0.511
|
0.378
|
0.452
|
0.454
|
0.369
|
0.394
|
|
rs4014195
|
0.368
|
0.356
|
0.341
|
0.342
|
0.383
|
0.350
|
0.338
|
0.347
|
0.382
|
0.367
|
0.398
|
0.386
|
0.370
|
0.354
|
0.367
|
0.406
|
0.412
|
0.571
|
0.369
|
0.250
|
0.376
|
0.500
|
0.420
|
0.352
|
0.402
|
0.398
|
0.351
|
0.375
|
0.386
|
0.378
|
0.252
|
0.269
|
0.201
|
0.256
|
|
rs1106766
|
0.790
|
0.776
|
0.756
|
0.756
|
0.799
|
0.798
|
0.764
|
0.767
|
0.786
|
0.722
|
0.852
|
0.797
|
0.805
|
0.823
|
0.850
|
0.844
|
0.816
|
0.857
|
0.803
|
0.875
|
0.821
|
0.812
|
0.928
|
0.947
|
0.934
|
0.933
|
0.919
|
0.982
|
0.966
|
0.946
|
0.978
|
0.981
|
0.962
|
0.975
|
|
rs28652632
|
0.548
|
0.549
|
0.522
|
0.523
|
0.517
|
0.545
|
0.531
|
0.537
|
0.528
|
0.512
|
0.523
|
0.500
|
0.536
|
0.515
|
0.498
|
0.500
|
0.538
|
0.214
|
0.523
|
0.625
|
0.578
|
0.562
|
0.681
|
0.668
|
0.660
|
0.666
|
0.738
|
0.696
|
0.625
|
0.797
|
0.592
|
0.574
|
0.550
|
0.567
|
|
rs738409
|
0.805
|
0.796
|
0.783
|
0.784
|
0.801
|
0.800
|
0.777
|
0.796
|
0.780
|
0.802
|
0.761
|
0.772
|
0.780
|
0.785
|
0.750
|
0.750
|
0.818
|
0.857
|
0.780
|
0.750
|
0.775
|
0.875
|
0.748
|
0.816
|
0.759
|
0.791
|
0.802
|
0.839
|
0.784
|
0.797
|
0.739
|
0.806
|
0.754
|
0.758
|
Running Models
The purpose of this section is to run all models of interest for assessing the relationship between gout/urate genetic risk and severity of gout. This includes modeling gout vs each PRS, and modeling age at gout onset and tophaceous disease vs the PRS. I have decided to not run any flare frequency models given that the phenotype is not suitable for analysis (see the assumptions section below).
# Loading phenotype file.
load(path(scratch_path, "Output/all_pheno_prs.RData"))
# Loading gout and urate PRS files.
load(here("Output/UKBB_Gene_OR.RData"))
load(here("Output/Tin_Gene_OR.RData"))
# Making a categorical flare frequency variable (FLARE_CAT) and setting all control gout severity traits to NA, then preparing the data further for models.
all_pheno_prs <- all_pheno_prs %>%
mutate(FLARE_CAT = factor(case_when(between(NUMATK, 0, 5) ~ paste0(as.character(NUMATK), " flares in last year"),
between(NUMATK, 6, 11) ~ "One every one to two months",
between(NUMATK, 12, 52) ~ "One or more per month"),
levels = c(paste0(0:5, " flares in last year"),
"One every one to two months",
"One or more per month"),
labels = c(paste0(0:5),
"6 - 11",
"12 - 52"),
ordered = TRUE),
AGE1ATK = case_when(GOUT ~ AGE1ATK),
DURATION = case_when(GOUT ~ DURATION),
NUMATK = case_when(GOUT ~ NUMATK),
TOPHIGOUT = case_when(GOUT ~ TOPHIGOUT),
ULT = case_when(GOUT ~ ULT),
SEX = factor(SEX, levels = c("Male", "Female")),
GOUT2 = factor(GOUT, levels = c(TRUE, FALSE), labels = c("Gout", "Control")),
ANCESTRY_GOUT = factor(case_when(GOUT & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "European Gout",
!GOUT & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "European Control",
GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study != "Ngati Porou" ~ "East Polynesian Gout",
!GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study != "Ngati Porou" ~ "East Polynesian Control",
GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study == "Ngati Porou" ~ "East Polynesian Gout - NP",
!GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study == "Ngati Porou" ~ "East Polynesian Control - NP",
GOUT & Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian Gout",
!GOUT & Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian Control"),
levels = c("European Gout",
"European Control",
"East Polynesian Gout",
"East Polynesian Control",
"East Polynesian Gout - NP",
"East Polynesian Control - NP",
"West Polynesian Gout",
"West Polynesian Control")),
ANCESTRY_GOUT_SEX = factor(case_when(ANCESTRY_GOUT == "European Gout" & SEX == "Male" ~ "European Gout - male",
ANCESTRY_GOUT == "European Gout" & SEX == "Female" ~ "European Gout - female",
ANCESTRY_GOUT == "European Control" & SEX == "Male" ~ "European Control - male",
ANCESTRY_GOUT == "European Control" & SEX == "Female" ~ "European Control - female",
ANCESTRY_GOUT == "East Polynesian Gout" & SEX == "Male" ~ "East Polynesian Gout - male",
ANCESTRY_GOUT == "East Polynesian Gout" & SEX == "Female" ~ "East Polynesian Gout - female",
ANCESTRY_GOUT == "East Polynesian Control" & SEX == "Male" ~ "East Polynesian Control - male",
ANCESTRY_GOUT == "East Polynesian Control" & SEX == "Female" ~ "East Polynesian Control - female",
ANCESTRY_GOUT == "East Polynesian Gout - NP" & SEX == "Male" ~ "East Polynesian Gout - NP - male",
ANCESTRY_GOUT == "East Polynesian Gout - NP" & SEX == "Female" ~ "East Polynesian Gout - NP - female",
ANCESTRY_GOUT == "East Polynesian Control - NP" & SEX == "Male" ~ "East Polynesian Control - NP - male",
ANCESTRY_GOUT == "East Polynesian Control - NP" & SEX == "Female" ~ "East Polynesian Control - NP - female",
ANCESTRY_GOUT == "West Polynesian Gout" & SEX == "Male" ~ "West Polynesian Gout - male",
ANCESTRY_GOUT == "West Polynesian Gout" & SEX == "Female" ~ "West Polynesian Gout - female",
ANCESTRY_GOUT == "West Polynesian Control" & SEX == "Male" ~ "West Polynesian Control - male",
ANCESTRY_GOUT == "West Polynesian Control" & SEX == "Female" ~ "West Polynesian Control - female"),
levels = c("European Gout - male",
"European Gout - female",
"European Control - male",
"European Control - female",
"East Polynesian Gout - male",
"East Polynesian Gout - female",
"East Polynesian Control - male",
"East Polynesian Control - female",
"East Polynesian Gout - NP - male",
"East Polynesian Gout - NP - female",
"East Polynesian Control - NP - male",
"East Polynesian Control - NP - female",
"West Polynesian Gout - male",
"West Polynesian Gout - female",
"West Polynesian Control - male",
"West Polynesian Control - female")),
SEX_GOUT = factor(case_when(GOUT & SEX == "Male" ~ "Male Gout",
GOUT & SEX == "Female" ~ "Female Gout",
!GOUT & SEX == "Male" ~ "Male Control",
!GOUT & SEX == "Female" ~ "Female Control"),
levels = c("Male Gout",
"Female Gout",
"Male Control",
"Female Control")),
COHORT_GOUT = factor(case_when(GOUT & Pheno.Study == "UK Biobank" ~ "UK Biobank - Gout",
!GOUT & Pheno.Study == "UK Biobank" ~ "UK Biobank - Control",
GOUT & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") & Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease") ~ "Aus/NZ - Gout",
!GOUT & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") & Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease") ~ "Aus/NZ - Control",
GOUT & Pheno.Study == "EuroGout" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "GlobalGout - Gout",
!GOUT & Pheno.Study == "EuroGout" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "GlobalGout - Control",
Pheno.Study == "Ardea: 401" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "Ardea - LASSO",
Pheno.Study == "Ardea: CLEAR1" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "Ardea - CLEAR1",
Pheno.Study == "Ardea: CLEAR2" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "Ardea - CLEAR2",
Pheno.Study == "Ardea: CRYSTAL" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "Ardea - CRYSTAL",
Pheno.Study == "Ardea: LIGHT" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "Ardea - LIGHT",
GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study != "Ngati Porou" ~ "East Polynesian - Gout",
!GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study != "Ngati Porou" ~ "East Polynesian - Control",
GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study == "Ngati Porou" ~ "East Polynesian - Gout - NP",
!GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study == "Ngati Porou" ~ "East Polynesian - Control - NP",
GOUT & Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian - Gout",
!GOUT & Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian - Control"),
levels = c("UK Biobank - Gout",
"UK Biobank - Control",
"Aus/NZ - Gout",
"Aus/NZ - Control",
"GlobalGout - Gout",
"GlobalGout - Control",
"Ardea - LASSO",
"Ardea - CLEAR1",
"Ardea - CLEAR2",
"Ardea - CRYSTAL",
"Ardea - LIGHT",
"East Polynesian - Gout",
"East Polynesian - Control",
"East Polynesian - Gout - NP",
"East Polynesian - Control - NP",
"West Polynesian - Gout",
"West Polynesian - Control"))) %>%
filter(!is.na(COHORT_GOUT),
!is.na(AGECOL),
!is.na(PRS),
(GOUT & !(is.na(AGE1ATK) & is.na(NUMATK) & is.na(TOPHIGOUT)) | Pheno.Study == "UK Biobank" | !GOUT))
# Making list with each of the cohort subsets. This will make it easier to manipulate the data for models.
data_list <- list("UK Biobank - Gout - Male" = all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Pheno.Study == "UK Biobank"),
"UK Biobank - Gout - Female" = all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Pheno.Study == "UK Biobank"),
"UK Biobank - Control - Male" = all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Pheno.Study == "UK Biobank"),
"UK Biobank - Control - Female" = all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Pheno.Study == "UK Biobank"),
"Aus/NZ European - Gout - Male" = all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
"Aus/NZ European - Gout - Female" = all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
"Aus/NZ European - Control - Male" = all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
"Aus/NZ European - Control - Female" = all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
"GlobalGout - Gout - Male" = all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"GlobalGout - Gout - Female" = all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"GlobalGout - Control - Male" = all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"GlobalGout - Control - Female" = all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - LASSO - Male" = all_pheno_prs %>% filter(SEX == "Male",
Pheno.Study == "Ardea: 401",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - LASSO - Female" = all_pheno_prs %>% filter(SEX == "Female",
Pheno.Study == "Ardea: 401",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - CLEAR1 - Male" = all_pheno_prs %>% filter(SEX == "Male",
Pheno.Study == "Ardea: CLEAR1",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - CLEAR1 - Female" = all_pheno_prs %>% filter(SEX == "Female",
Pheno.Study == "Ardea: CLEAR1",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - CLEAR2 - Male" = all_pheno_prs %>% filter(SEX == "Male",
Pheno.Study == "Ardea: CLEAR2",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - CLEAR2 - Female" = all_pheno_prs %>% filter(SEX == "Female",
Pheno.Study == "Ardea: CLEAR2",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - CRYSTAL - Male" = all_pheno_prs %>% filter(SEX == "Male",
Pheno.Study == "Ardea: CRYSTAL",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - CRYSTAL - Female" = all_pheno_prs %>% filter(SEX == "Female",
Pheno.Study == "Ardea: CRYSTAL",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - LIGHT - Male" = all_pheno_prs %>% filter(SEX == "Male",
Pheno.Study == "Ardea: LIGHT",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - LIGHT - Female" = all_pheno_prs %>% filter(SEX == "Female",
Pheno.Study == "Ardea: LIGHT",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"East Polynesian - Gout - Male" = all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
"East Polynesian - Gout - Female" = all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
"East Polynesian - Control - Male" = all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
"East Polynesian - Control - Female" = all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
"East Polynesian - Gout - NP - Male" = all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
"East Polynesian - Gout - NP - Female" = all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
"East Polynesian - Control - NP - Male" = all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
"East Polynesian - Control - NP - Female" = all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
"West Polynesian - Gout - Male" = all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")),
"West Polynesian - Gout - Female" = all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")),
"West Polynesian - Control - Male" = all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")),
"West Polynesian - Control - Female" = all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")))
# Making function for extracting the odds ratio for a named predictor from a logistic model object, then formatting it for printing.
OR <- function(x, Predictor) {
sprintf(exp(coef(x))[[Predictor]], fmt = "%#.4f")
}
# Making function for extracting the lower confidence limit of the odds ratio for a named predictor from a logistic model object, then formatting it for printing.
LCL_OR <- function(x, Predictor) {
sprintf(exp(confint.default(x))[Predictor, 1], fmt = "%#.4f")
}
# Making function for extracting the upper confidence limit of the odds ratio for a named predictor from a logistic model object, then formatting it for printing.
UCL_OR <- function(x, Predictor) {
sprintf(exp(confint.default(x))[Predictor, 2], fmt = "%#.4f")
}
# Making function for extracting the p-value for a named predictor from any model object, then formatting it for printing.
Pval <- function(x, Predictor) {
signif(summary(x)$coefficients[Predictor, 4], 3)
}
# Making function for extracting the beta for a named predictor from a linear model object, then formatting it for printing.
Beta <- function(x, Predictor) {
sprintf(coef(x)[[Predictor]], fmt = "%#.4f")
}
# Making function for extracting the lower confidence limit of the beta for a named predictor from a linear model object, then formatting it for printing.
LCL <- function(x, Predictor) {
sprintf(confint.default(x)[Predictor, 1], fmt = "%#.4f")
}
# Making function for extracting the upper confidence limit of the beta for a named predictor from a linear model object, then formatting it for printing.
UCL <- function(x, Predictor) {
sprintf(confint.default(x)[Predictor, 2], fmt = "%#.4f")
}
Assumptions
Prior to running any models, I will test that we meet the assumptions of linear and logistic regression. For linear regression, the assumptions are:
- Linear outcome variable
- Independent observations
- No extreme outliers (Cook’s distance)
- Linear relationship between variables
- Normality of residuals
- Homoscedasticity of residuals
Logistic regression models assume:
- Binary outcome variable
- Independent observations
- No extreme outliers (Cook’s distance)
- Linear relationship between predictor and logit of the outcome (Box-Tidwell test)
All linear and logistic multiple regressions also relies on no multicollinearity (this is the only assumption for logistic regression).
All assumptions should be tested/confirmed for all models, but for now I will just test some representative models instead. This will be for each of the main outcomes in a pooled cohort of male European gout cases or cases/controls. The outcome variable type and independence assumptions can be assumed to be met. I will need to check for evidence of extreme outliers in both model types, and to test linearity between the predictor and the outcome (or log-odds of the outcome). For linear models, I will also test the normality and homoscedasticity of residuals assumptions. Finally I will need to test multicollinearity for both types of model.
# Producing a temporary cohort for running the models.
tmp <- all_pheno_prs %>%
filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("European", "Iberian", "European; Iberian"),
!is.na(PRS))
# Testing all assumptions for representative age at onset model.
# Testing age at onset model linearity assumption by plotting PRS vs age at onset, showing apparent linear relationship.
ggplot(tmp, aes(x = PRS, y = AGE1ATK, color = COHORT_GOUT)) +
geom_point(shape = 1) +
geom_smooth(se = FALSE)
# Producing model object.
mod <- lm(AGE1ATK ~ PRS + Geno.PCVector1 + Geno.PCVector2 + Geno.PCVector3 + Geno.PCVector4 + Geno.PCVector5 + Geno.PCVector6 + Geno.PCVector7 + Geno.PCVector8 + Geno.PCVector9 + Geno.PCVector10, data = tmp)
# Adding parameters to model object for testing other assumptions.
test <- augment(mod)
# Plotting fitted values vs residual values showing no obvious pattern, so homoscedasticity seems to be met.
ggplot(test, aes(x = .fitted, y = .resid)) +
geom_point()
# Plotting distribution of residuals in histogram, showing normality of residuals.
ggplot(data = test, mapping = aes(x = .resid)) +
geom_histogram(mapping = aes(y = ..density..), bins = 30, fill = 'gray', color = 'black') +
stat_function(fun = 'dnorm',
args = list(mean = mean(test$.resid), sd = sd(test$.resid)),
color = 'red')
# Plotting distribution of residuals in quantile-quantile plot, showing normality of residuals.
ggplot(data = test, mapping = aes(sample = .resid)) +
geom_qq() +
geom_qq_line()
# Plotting distribution of residuals in boxplot, showing normality of residuals.
ggplot(data = test, mapping = aes(x = .resid)) +
geom_boxplot() +
theme(axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank())
# Producing list of normality test statistics for residuals, showing skewness and kurtosis values are within normal range, ignoring the 2SE and Shapiro-Wilk test values given the large dataset.
test %>%
pull(.resid) %>%
stat.desc(basic = FALSE, desc = FALSE, norm = TRUE) %>%
enframe() %>%
pivot_wider(names_from = name, values_from = value)
# Testing multicollinearity using variance inflation factor, none of which are over 5 or 10, indicating a lack of issues with this assumption.
vif(mod)
# Testing assumptions of representative tophaceous gout model.
# Plotting relationship between variables to be tested.
ggplot(tmp, aes(x = TOPHIGOUT, y = PRS)) +
stat_summary(fun = mean, geom = "point") +
stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.3, alpha = 0.75) +
labs(y = "Gout PRS") +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
# Running model for PRS vs tophaceous gout.
mod <- glm(TOPHIGOUT ~ PRS + Geno.PCVector1 + Geno.PCVector2 + Geno.PCVector3 + Geno.PCVector4 + Geno.PCVector5 + Geno.PCVector6 + Geno.PCVector7 + Geno.PCVector8 + Geno.PCVector9 + Geno.PCVector10, family = "binomial", data = tmp)
# Adding parameters to model object for testing other assumptions.
test <- augment(mod)
# Plotting fitted values as a function of PRS, shows linearity of relationship.
ggplot(test, aes(x = PRS, y = .fitted)) +
geom_point()
# Plotting the Cook's distance will identifies outliers, which there appear to be based on the 4/N cutoff.
ggplot(test, aes(x = PRS, y = .cooksd, alpha = abs(.std.resid))) +
geom_point() +
geom_hline(yintercept = 4 / NROW(test), color = "red")
# Doing the same for standard residuals, which should be under 3. Given all values are under 3, I will assume this assumption is met.
ggplot(test, aes(x = PRS, y = abs(.std.resid), alpha = .cooksd)) +
geom_point() +
geom_hline(yintercept = 3, color = "red")
# Testing all assumptions for representative flare frequency model.
# First modifying the flare variable to exclude outliers and removing individuals with fewer than 2 flares reported in the last year.
tmp <- tmp %>%
mutate(NUMATK = case_when(NUMATK > 52 ~ 52,
TRUE ~ NUMATK)) %>%
filter(NUMATK >= 2,
!is.na(NUMATK))
# Testing flare frequency model linearity assumption by plotting PRS vs flare frequency, showing apparent lack of linear relationship.
ggplot(tmp, aes(x = PRS, y = NUMATK, color = COHORT_GOUT)) +
geom_point(shape = 1) +
geom_smooth(se = FALSE)
# Producing model object.
mod <- lm(NUMATK ~ PRS + Geno.PCVector1 + Geno.PCVector2 + Geno.PCVector3 + Geno.PCVector4 + Geno.PCVector5 + Geno.PCVector6 + Geno.PCVector7 + Geno.PCVector8 + Geno.PCVector9 + Geno.PCVector10, data = tmp)
# Add parameters to model object for testing other assumptions.
test <- augment(mod)
# Plotting fitted values vs residual values showing obvious pattern, so homoscedasticity seems to not be met.
ggplot(test, aes(x = .fitted, y = .resid)) +
geom_point()
# Plotting distribution of residuals in histogram, showing lack of normality of residuals.
ggplot(data = test, mapping = aes(x = .resid)) +
geom_histogram(mapping = aes(y = ..density..), bins = 30, fill = 'gray', color = 'black') +
stat_function(fun = 'dnorm',
args = list(mean = mean(test$.resid), sd = sd(test$.resid)),
color = 'red')
# Plotting distribution of residuals in quantile-quantile plot, showing lack of normality of residuals.
ggplot(data = test, mapping = aes(sample = .resid)) +
geom_qq() +
geom_qq_line()
# Plotting distribution of residuals in boxplot, showing lack of normality of residuals.
ggplot(data = test, mapping = aes(x = .resid)) +
geom_boxplot() +
theme(axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank())
# Producing list of normality test statistics for residuals, showing skewness and kurtosis values are outside of normal range, ignoring the 2SE and Shapiro-Wilk test values given the large dataset.
test %>%
pull(.resid) %>%
stat.desc(basic = FALSE, desc = FALSE, norm = TRUE) %>%
enframe() %>%
pivot_wider(names_from = name, values_from = value)
# Testing multicollinearity using variance inflation factor, none of which are over 5 or 10, indicating a lack of issues with this assumption.
vif(mod)
Based on the above, it should be appropriate to run linear regression models on age at onset, but not flare frequency. Given that the flare frequency variable cannot be analyzed as a linear variable, I then tested categorized flare frequency as an outcome. I tested whether it would be possible to run both an ordinal logistic regression and an ANOVA model.
The ordinal logistic regression test should tell us the odds of being in any combination of higher flare categories vs the remainder of the categories (i.e. highest flare group vs all others, or highest 5 flare groups vs lowest flare group). The interpretation of this model would be that for those individuals with 1 unit more PRS than another group, the odds of being in a higher flare category (i.e. >= 3 flares vs < 3 flares) were X times those of the other group (i.e. those with 1 unit PRS lower - the reference group).
The ANOVA should tell us if there is a difference in mean PRS between groups of flare categories. I am not sure how easy it will be to adjust for covariates in an ANOVA (perhaps I could run an ANCOVA but it might not be doing what I think it’s doing). I should further note that I am unclear on how to meta-analyze ANOVA results - perhaps just running them within each cohort and determining how many show significant differences and, of those, which groups were different and in which direction? The other option is to pool cohorts but that runs into potentially weird issues of bias. So I think in the end I should run both models and see if I get a different answer, then decide how to present them in the paper.
# Testing all assumptions for representative flare frequency model, using a categorical transformation of the flare variable.
# Running the model.
mod <- polr(FLARE_CAT ~ PRS, data = tmp, Hess = TRUE)
# Extracting the model results from the model object, suggesting that the model is significant for the full male gout cohort, with an OR of 1.13 [1.02, 1.26], p = 0.0165 per PRS unit.
modsum <- tibble("Group of interest" = row.names(summary(mod)$coefficients)[1],
"OR" = exp(summary(mod)$coefficients)[1],
"Lower CI" = exp(confint.default(mod))[1],
"Upper CI" = exp(confint.default(mod))[2],
"P-value" = pnorm(abs(summary(mod)$coefficients[1, "t value"]), lower.tail = FALSE) * 2)
# Visualizing this by looking at proportions of different flare categories within bins of PRS.
tmp %>%
filter(!is.na(PRS)) %>%
mutate(PRS_bin = factor(ntile(PRS, 6), ordered = T),
FLARE_CAT = factor(case_when(is.na(FLARE_CAT) ~ "No Data",
TRUE ~ as.character(FLARE_CAT)),
levels = rev(c(paste0(0:5),
"6 - 11",
"12 - 52",
"No Data")),
ordered = TRUE)) %>%
group_by(PRS_bin, FLARE_CAT, SEX, ANCESTRY_GOUT) %>%
summarize(value = n()) %>%
ggplot(aes(fill = FLARE_CAT, y = value, x = PRS_bin)) +
geom_bar(position = "fill", stat = "identity") +
facet_wrap(~ SEX * ANCESTRY_GOUT) +
scale_fill_discrete(type = c("#C0C0C0", "#FDE725FF", "#9FDA3AFF", "#4AC16DFF", "#1FA187FF", "#277F8EFF", "#365C8DFF", "#46337EFF", "#440154FF")) +
theme(axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank())
# Running the ANOVA test, essentially asking "does the mean PRS differ between flare categories".
# The assumptions of a one-way independent ANOVA are:
# 1. Independence of the datapoints (met)
# 2. Normality of model residuals within groups
# 3. Homogeneity of variance of model residuals within groups
# Creating model object.
mod <- tmp %>%
mutate(IID = factor(IID),
FLARE_CAT = factor(FLARE_CAT)) %>%
ezANOVA(data = .,
dv = PRS,
between = FLARE_CAT,
wid = IID,
type = 3,
return_aov = TRUE)
# Printing results of model shows that there is a non-significant main effect of the PRS.
mod
# Visualizing the relationship.
tmp %>%
ggplot(aes(x = FLARE_CAT, y = PRS)) +
geom_boxplot()
# Cleaning up.
rm(tmp, mod, modsum, test)
Given that both models gave different results, it seems I should be cautious about using either approach in the manuscript. Perhaps I can return to this in a later study.
Gout Models
Below, we model the PRS, the PRS without ABCG2, and both ABCG2 SNPs simultaneously against gout, adjusting for global PCs. This is done in males and females separately, and is both adjusted for age at collection and unadjusted. It is also run with the 10 oceanian PCs for Polynesian cohorts. The output of all of these models is stored in a list object which we then extract all of the important elements from in a table format.
# Producing list of datasets for modeling gout vs each predictor.
gout_data_list <- list("UK Biobank - Male" = full_join(data_list[["UK Biobank - Gout - Male"]],
data_list[["UK Biobank - Control - Male"]]),
"UK Biobank - Female" = full_join(data_list[["UK Biobank - Gout - Female"]],
data_list[["UK Biobank - Control - Female"]]),
"Aus/NZ European - Male" = full_join(data_list[["Aus/NZ European - Gout - Male"]],
data_list[["Aus/NZ European - Control - Male"]]),
"Aus/NZ European - Female" = full_join(data_list[["Aus/NZ European - Gout - Female"]],
data_list[["Aus/NZ European - Control - Female"]]),
"East Polynesian - Male" = full_join(data_list[["East Polynesian - Gout - Male"]],
data_list[["East Polynesian - Control - Male"]]),
"East Polynesian - Female" = full_join(data_list[["East Polynesian - Gout - Female"]],
data_list[["East Polynesian - Control - Female"]]),
"East Polynesian - NP - Male" = full_join(data_list[["East Polynesian - Gout - NP - Male"]],
data_list[["East Polynesian - Control - NP - Male"]]),
"East Polynesian - NP - Female" = full_join(data_list[["East Polynesian - Gout - NP - Female"]],
data_list[["East Polynesian - Control - NP - Female"]]),
"West Polynesian - Male" = full_join(data_list[["West Polynesian - Gout - Male"]],
data_list[["West Polynesian - Control - Male"]]),
"West Polynesian - Female" = full_join(data_list[["West Polynesian - Gout - Female"]],
data_list[["West Polynesian - Control - Female"]]))
# Removing any cohorts with fewer than 20 individuals.
for(i in length(gout_data_list):1){
if(nrow(gout_data_list[[i]]) < 20){
gout_data_list[[i]] <- NULL
}
}
# Producing empty list for containing the model objects.
modlist <- vector("list", length(gout_data_list))
# Producing list containing the identities of each predictor or set of predictors.
prsnames <- list("PRS", "PRS_noABCG2", c("rs2231142", "rs10011796"))
# For each of the cohorts in gout_data_list.
for(i in seq_along(gout_data_list)){
# Making empty vector called covariates.
covariates <- c()
# If the cohort is not from the UK Biobank.
if(!str_detect(names(gout_data_list)[i], "UK Biobank")){
# Adding global PCs as covariates.
covariates <- c(covariates, "Geno.PCVector1", "Geno.PCVector2", "Geno.PCVector3", "Geno.PCVector4", "Geno.PCVector5", "Geno.PCVector6", "Geno.PCVector7", "Geno.PCVector8", "Geno.PCVector9", "Geno.PCVector10")
}
# If the cohort is Polynesian.
if(str_detect(names(gout_data_list)[i], "Polynesian")){
# Adding oceanian PCs as covariates.
covariates <- c(covariates, "Geno.PCVector1_Oc", "Geno.PCVector2_Oc", "Geno.PCVector3_Oc", "Geno.PCVector4_Oc", "Geno.PCVector5_Oc", "Geno.PCVector6_Oc", "Geno.PCVector7_Oc", "Geno.PCVector8_Oc", "Geno.PCVector9_Oc", "Geno.PCVector10_Oc")
}
# Making a second covariate vector with age at recruitment.
covariates2 <- c(covariates, "AGECOL")
# Making an empty list that is twice the length of prsnames.
tmplist <- vector("list", 2 * length(prsnames))
# For each of the predictors in prsnames.
for(j in seq_along(prsnames)) {
# Setting the full vector of variables to be the predictor and all covariates.
variables <- c(prsnames[[j]], covariates)
# Making a formula object with gout as the outcome and the variables as predictors.
f <- as.formula(paste("GOUT", paste(variables, collapse = " + "), sep = " ~ "))
# Creating a logistic model object based on this formula.
assign(paste0("Model_", i, "_", j), glm(f, family = binomial, data = gout_data_list[[i]]))
# Creating the first part of the output vector with Cohort, N, N case, N control, and Outcome values.
modstring1 <- c(names(gout_data_list)[[i]],
nrow(gout_data_list[[i]]),
nrow(gout_data_list[[i]] %>% filter(GOUT)),
nrow(gout_data_list[[i]] %>% filter(!GOUT)),
"Gout")
# Making empty list which will have each element being the model output vector.
modstring <- list()
# If the current predictor name contains the string "PRS".
if(any(str_detect(prsnames[[j]], "PRS"))) {
# Making the complete model output vector with modstring1 alongside the following columns: Predictors, Predictor, Covariates, OR, LCL, UCL, Pval, log-odds, standard error.
modstring[[1]] <- c(modstring1,
paste(prsnames[[j]], collapse = " + "),
paste(prsnames[[j]], collapse = " + "),
paste(covariates, collapse = " + "),
OR(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
LCL_OR(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
UCL_OR(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
Pval(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
coef(get(paste0("Model_", i, "_", j)))[[prsnames[[j]]]],
summary(get(paste0("Model_", i, "_", j)))$coefficients[prsnames[[j]], 2])
} else{
# If the current predictor name doesn't contain PRS (i.e. if it is the two ABCG2 variants), do the following to each element of the list.
for(k in 1:length(prsnames[[j]])){
# Making the complete model output vector with modstring1 alongside the following columns: Predictors, Predictor, Covariates, OR, LCL, UCL, Pval, log-odds, standard error.
modstring[[k]] <- c(modstring1,
paste(prsnames[[j]], collapse = " + "),
paste(prsnames[[j]][k], collapse = " + "),
paste(covariates, collapse = " + "),
OR(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
LCL_OR(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
UCL_OR(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
Pval(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
coef(get(paste0("Model_", i, "_", j)))[[prsnames[[j]][k]]],
summary(get(paste0("Model_", i, "_", j)))$coefficients[prsnames[[j]][k], 2])
}
}
# Adding the entire modstring to the jth position of tmplist.
tmplist[[j]] <- modstring
# Now doing all of the above but with adjustment for age at recruitment, setting the full vector of variables to be the predictor and all covariates.
variables <- c(prsnames[[j]], covariates2)
# Making a formula object with gout as the outcome and the variables as predictors.
f <- as.formula(paste("GOUT", paste(variables, collapse = " + "), sep = " ~ "))
# Creating a logistic model object based on this formula.
assign(paste0("Model_", i, "_", j, "_adj"), glm(f, family = binomial, data = gout_data_list[[i]]))
# Creating the first part of the output vector with Cohort, N, N case, N control, and Outcome values.
modstring1 <- c(names(gout_data_list)[[i]],
nrow(gout_data_list[[i]]),
nrow(gout_data_list[[i]] %>% filter(GOUT)),
nrow(gout_data_list[[i]] %>% filter(!GOUT)),
"Gout")
# Making empty list which will have each element being the model output vector.
modstring <- list()
# If the current predictor name contains the string "PRS".
if(any(str_detect(prsnames[[j]], "PRS"))) {
# Making the complete model output vector with modstring1 alongside the following columns: Predictors, Predictor, Covariates, OR, LCL, UCL, Pval, log-odds, standard error.
modstring[[1]] <- c(modstring1,
paste(prsnames[[j]], collapse = " + "),
paste(prsnames[[j]], collapse = " + "),
paste(covariates2, collapse = " + "),
OR(get(paste0("Model_", i, "_", j, "_adj")), prsnames[[j]]),
LCL_OR(get(paste0("Model_", i, "_", j, "_adj")), prsnames[[j]]),
UCL_OR(get(paste0("Model_", i, "_", j, "_adj")), prsnames[[j]]),
Pval(get(paste0("Model_", i, "_", j, "_adj")), prsnames[[j]]),
coef(get(paste0("Model_", i, "_", j, "_adj")))[[prsnames[[j]]]],
summary(get(paste0("Model_", i, "_", j, "_adj")))$coefficients[prsnames[[j]], 2])
} else{
# If the current predictor name doesn't contain PRS (i.e. if it is the two ABCG2 variants), do the following to each element of the list.
for(k in 1:length(prsnames[[j]])){
# Making the complete model output vector with modstring1 alongside the following columns: Predictors, Predictor, Covariates, OR, LCL, UCL, Pval, log-odds, standard error.
modstring[[k]] <- c(modstring1,
paste(prsnames[[j]], collapse = " + "),
paste(prsnames[[j]][k], collapse = " + "),
paste(covariates2, collapse = " + "),
OR(get(paste0("Model_", i, "_", j, "_adj")), prsnames[[j]][k]),
LCL_OR(get(paste0("Model_", i, "_", j, "_adj")), prsnames[[j]][k]),
UCL_OR(get(paste0("Model_", i, "_", j, "_adj")), prsnames[[j]][k]),
Pval(get(paste0("Model_", i, "_", j, "_adj")), prsnames[[j]][k]),
coef(get(paste0("Model_", i, "_", j, "_adj")))[[prsnames[[j]][k]]],
summary(get(paste0("Model_", i, "_", j, "_adj")))$coefficients[prsnames[[j]][k], 2])
}
}
# Adding the remainder of the modstring to tmplist.
tmplist[[j + length(prsnames)]] <- modstring
}
# Adding all model summaries for each cohort to the modlist list.
modlist[[i]] <- tmplist
}
# Cleaning up.
rm(list = ls()[str_detect(ls(), "Model_")])
# Collapsing the lists to create a dataframe with all results for gout models.
tmp <- modlist %>%
flatten() %>%
flatten() %>%
as.data.frame() %>%
data.table::transpose()
# Setting the column names of the dataframe.
colnames(tmp) <- c("Cohort", "N", "N case", "N control", "Outcome", "Predictors", "Predictor", "Covariates", "OR", "LCL", "UCL", "Pval", "log-odds", "SE")
# Setting the column types.
GoutModels <- tmp %>%
mutate(across(c(Cohort, Outcome, Predictors, Predictor, Covariates), factor),
across(c(N, `N case`, `N control`, OR, LCL, UCL, Pval, `log-odds`, SE), as.numeric))
# Saving the goutmodels object.
save(GoutModels, file = here("Output/GoutModels.RData"))
# Cleaning up.
rm(modlist, tmp, tmplist, covariates, covariates2, f, i, j, k, modstring, modstring1, variables, gout_data_list)
Severity Models
Next, we model the PRS, the PRS without ABCG2, and both ABCG2 variants simultaneously against age at onset and tophaceous disease, adjusting for global PCs. It is done in males and females separately, and is both adjusted for disease duration (for tophaceous disease) and unadjusted. It is also run with the 10 oceanian PCs for Polynesian cohorts.
# Preparing the list of cohorts for age at onset models.
onset_data_list <- list("Aus/NZ European - Male" = data_list[["Aus/NZ European - Gout - Male"]] %>% filter(!is.na(AGE1ATK)),
"Aus/NZ European - Female" = data_list[["Aus/NZ European - Gout - Female"]] %>% filter(!is.na(AGE1ATK)),
"GlobalGout - Male" = data_list[["GlobalGout - Gout - Male"]] %>% filter(!is.na(AGE1ATK)),
"GlobalGout - Female" = data_list[["GlobalGout - Gout - Female"]] %>% filter(!is.na(AGE1ATK)),
"Ardea - LASSO - Male" = data_list[["Ardea - LASSO - Male"]] %>% filter(!is.na(AGE1ATK)),
"Ardea - LASSO - Female" = data_list[["Ardea - LASSO - Female"]] %>% filter(!is.na(AGE1ATK)),
"Ardea - CLEAR1 - Male" = data_list[["Ardea - CLEAR1 - Male"]] %>% filter(!is.na(AGE1ATK)),
"Ardea - CLEAR1 - Female" = data_list[["Ardea - CLEAR1 - Female"]] %>% filter(!is.na(AGE1ATK)),
"Ardea - CLEAR2 - Male" = data_list[["Ardea - CLEAR2 - Male"]] %>% filter(!is.na(AGE1ATK)),
"Ardea - CLEAR2 - Female" = data_list[["Ardea - CLEAR2 - Female"]] %>% filter(!is.na(AGE1ATK)),
"Ardea - CRYSTAL - Male" = data_list[["Ardea - CRYSTAL - Male"]] %>% filter(!is.na(AGE1ATK)),
"Ardea - CRYSTAL - Female" = data_list[["Ardea - CRYSTAL - Female"]] %>% filter(!is.na(AGE1ATK)),
"Ardea - LIGHT - Male" = data_list[["Ardea - LIGHT - Male"]] %>% filter(!is.na(AGE1ATK)),
"Ardea - LIGHT - Female" = data_list[["Ardea - LIGHT - Female"]] %>% filter(!is.na(AGE1ATK)),
"East Polynesian - Male" = data_list[["East Polynesian - Gout - Male"]] %>% filter(!is.na(AGE1ATK)),
"East Polynesian - Female" = data_list[["East Polynesian - Gout - Female"]] %>% filter(!is.na(AGE1ATK)),
"East Polynesian - NP - Male" = data_list[["East Polynesian - Gout - NP - Male"]] %>% filter(!is.na(AGE1ATK)),
"East Polynesian - NP - Female" = data_list[["East Polynesian - Gout - NP - Female"]] %>% filter(!is.na(AGE1ATK)),
"West Polynesian - Male" = data_list[["West Polynesian - Gout - Male"]] %>% filter(!is.na(AGE1ATK)),
"West Polynesian - Female" = data_list[["West Polynesian - Gout - Female"]] %>% filter(!is.na(AGE1ATK)))
# Removing any cohorts with fewer than 20 individuals.
for(i in length(onset_data_list):1){
if(nrow(onset_data_list[[i]]) < 20){
onset_data_list[[i]] <- NULL
}
}
# Producing empty list for containing the model objects.
modlist <- vector("list", length(onset_data_list))
# For each of the cohorts in onset_data_list.
for(i in seq_along(onset_data_list)){
# Making empty vector called covariates.
covariates <- c()
# If the cohort is not from the UK Biobank.
if(!str_detect(names(onset_data_list)[i], "UK Biobank")){
# Adding global PCs as covariates.
covariates <- c(covariates, "Geno.PCVector1", "Geno.PCVector2", "Geno.PCVector3", "Geno.PCVector4", "Geno.PCVector5", "Geno.PCVector6", "Geno.PCVector7", "Geno.PCVector8", "Geno.PCVector9", "Geno.PCVector10")
}
# If the cohort is Polynesian.
if(str_detect(names(onset_data_list)[i], "Polynesian")){
# Adding oceanian PCs as covariates.
covariates <- c(covariates, "Geno.PCVector1_Oc", "Geno.PCVector2_Oc", "Geno.PCVector3_Oc", "Geno.PCVector4_Oc", "Geno.PCVector5_Oc", "Geno.PCVector6_Oc", "Geno.PCVector7_Oc", "Geno.PCVector8_Oc", "Geno.PCVector9_Oc", "Geno.PCVector10_Oc")
}
# Making an empty list that is the length of prsnames.
tmplist <- vector("list", length(prsnames))
# For each of the predictors in prsnames.
for(j in seq_along(prsnames)) {
# Setting the full vector of variables to be the predictor and all covariates.
variables <- c(prsnames[[j]], covariates)
# Making a formula object with age at onset as the outcome and the variables as predictors.
f <- as.formula(paste("AGE1ATK", paste(variables, collapse = " + "), sep = " ~ "))
# Creating a linear model object based on this formula.
assign(paste0("Model_", i, "_", j), lm(f, data = onset_data_list[[i]]))
# Creating the first part of the output vector with Cohort, N, and Outcome values.
modstring1 <- c(names(onset_data_list)[[i]],
nrow(onset_data_list[[i]]),
"Age at Onset (years)")
# Making empty list which will have each element being the model output vector.
modstring <- list()
# If the current predictor name contains the string "PRS".
if(any(str_detect(prsnames[[j]], "PRS"))) {
# Making the complete model output vector with modstring1 alongside the following columns: Predictors, Predictor, Covariates, Beta, LCL, UCL, Pval, standard error.
modstring[[1]] <- c(modstring1,
paste(prsnames[[j]], collapse = " + "),
paste(prsnames[[j]], collapse = " + "),
paste(covariates, collapse = " + "),
Beta(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
LCL(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
UCL(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
Pval(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
summary(get(paste0("Model_", i, "_", j)))$coefficients[prsnames[[j]], 2])
} else{
# If the current predictor name doesn't contain PRS (i.e. if it is the two ABCG2 variants), do the following to each element of the list.
for(k in 1:length(prsnames[[j]])){
# Making the complete model output vector with modstring1 alongside the following columns: Predictors, Predictor, Covariates, Beta, LCL, UCL, Pval, standard error.
modstring[[k]] <- c(modstring1,
paste(prsnames[[j]], collapse = " + "),
paste(prsnames[[j]][k], collapse = " + "),
paste(covariates, collapse = " + "),
Beta(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
LCL(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
UCL(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
Pval(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
summary(get(paste0("Model_", i, "_", j)))$coefficients[prsnames[[j]][k], 2])
}
}
# Adding the remainder of the modstring to tmplist.
tmplist[[j]] <- modstring
}
# Adding all model summaries for each cohort to the modlist list.
modlist[[i]] <- tmplist
}
# Cleaning up.
rm(list = ls()[str_detect(ls(), "Model_")])
# Collapsing the lists to create a dataframe with all results for onset models.
tmp <- modlist %>%
flatten() %>%
flatten() %>%
as.data.frame() %>%
data.table::transpose()
# Setting the column names of the dataframe.
colnames(tmp) <- c("Cohort", "N", "Outcome", "Predictors", "Predictor", "Covariates", "Beta", "LCL", "UCL", "Pval", "SE")
# Setting the column types.
OnsetModels <- tmp %>%
mutate(across(c(Cohort, Outcome, Predictors, Predictor, Covariates), factor),
across(c(N, Beta, LCL, UCL, Pval, SE), as.numeric))
# Saving the OnsetModels dataframe.
save(OnsetModels, file = here("Output/OnsetModels.RData"))
# Cleaning up.
rm(modlist, tmp, tmplist, covariates, f, i, j, k, modstring, modstring1, variables, onset_data_list)
# Preparing the list of cohorts for tophaceous disease models.
tophi_data_list <- list("Aus/NZ European - Male" = data_list[["Aus/NZ European - Gout - Male"]] %>% filter(!is.na(TOPHIGOUT)),
"Aus/NZ European - Female" = data_list[["Aus/NZ European - Gout - Female"]] %>% filter(!is.na(TOPHIGOUT)),
"GlobalGout - Male" = data_list[["GlobalGout - Gout - Male"]] %>% filter(!is.na(TOPHIGOUT)),
"GlobalGout - Female" = data_list[["GlobalGout - Gout - Female"]] %>% filter(!is.na(TOPHIGOUT)),
"Ardea - LASSO - Male" = data_list[["Ardea - LASSO - Male"]] %>% filter(!is.na(TOPHIGOUT)),
"Ardea - LASSO - Female" = data_list[["Ardea - LASSO - Female"]] %>% filter(!is.na(TOPHIGOUT)),
"Ardea - CLEAR1 - Male" = data_list[["Ardea - CLEAR1 - Male"]] %>% filter(!is.na(TOPHIGOUT)),
"Ardea - CLEAR1 - Female" = data_list[["Ardea - CLEAR1 - Female"]] %>% filter(!is.na(TOPHIGOUT)),
"Ardea - CLEAR2 - Male" = data_list[["Ardea - CLEAR2 - Male"]] %>% filter(!is.na(TOPHIGOUT)),
"Ardea - CLEAR2 - Female" = data_list[["Ardea - CLEAR2 - Female"]] %>% filter(!is.na(TOPHIGOUT)),
"Ardea - LIGHT - Male" = data_list[["Ardea - LIGHT - Male"]] %>% filter(!is.na(TOPHIGOUT)),
"Ardea - LIGHT - Female" = data_list[["Ardea - LIGHT - Female"]] %>% filter(!is.na(TOPHIGOUT)),
"East Polynesian - Male" = rbind(data_list[["East Polynesian - Gout - Male"]] %>% filter(!is.na(TOPHIGOUT)), data_list[["East Polynesian - Gout - NP - Male"]] %>% filter(!is.na(TOPHIGOUT))),
"East Polynesian - Female" = rbind(data_list[["East Polynesian - Gout - Female"]] %>% filter(!is.na(TOPHIGOUT)), data_list[["East Polynesian - Gout - NP - Female"]] %>% filter(!is.na(TOPHIGOUT))),
"West Polynesian - Male" = data_list[["West Polynesian - Gout - Male"]] %>% filter(!is.na(TOPHIGOUT)),
"West Polynesian - Female" = data_list[["West Polynesian - Gout - Female"]] %>% filter(!is.na(TOPHIGOUT)))
# Removing cohorts with fewer than 20 individuals.
for(i in length(tophi_data_list):1){
if(nrow(tophi_data_list[[i]]) < 20){
tophi_data_list[[i]] <- NULL
}
}
# Creating empty list for inputting model summaries.
modlist <- vector("list", length(tophi_data_list))
# For each cohort in tophi_data_list.
for(i in seq_along(tophi_data_list)){
# Making empty vector for covariates.
covariates <- c()
# If the cohort is not UK Biobank.
if(!str_detect(names(tophi_data_list)[i], "UK Biobank")){
# Adding global genetic PCs as covariates.
covariates <- c(covariates, "Geno.PCVector1", "Geno.PCVector2", "Geno.PCVector3", "Geno.PCVector4", "Geno.PCVector5", "Geno.PCVector6", "Geno.PCVector7", "Geno.PCVector8", "Geno.PCVector9", "Geno.PCVector10")
}
# If the cohort is Polynesian.
if(str_detect(names(tophi_data_list)[i], "Polynesian")){
# Adding oceanian genetic PCs as covariates.
covariates <- c(covariates, "Geno.PCVector1_Oc", "Geno.PCVector2_Oc", "Geno.PCVector3_Oc", "Geno.PCVector4_Oc", "Geno.PCVector5_Oc", "Geno.PCVector6_Oc", "Geno.PCVector7_Oc", "Geno.PCVector8_Oc", "Geno.PCVector9_Oc", "Geno.PCVector10_Oc")
}
# Making second list of covariates with disease duration.
covariates2 <- c(covariates, "DURATION")
# Making output list that is twice the number of predictors.
tmplist <- vector("list", 2 * length(prsnames))
# For each predictor.
for(j in seq_along(prsnames)) {
# Making a variables vector with the predictor and covariates.
variables <- c(prsnames[[j]], covariates)
# Preparing a formula object.
f <- as.formula(paste("TOPHIGOUT", paste(variables, collapse = " + "), sep = " ~ "))
# Running a logistic model based on the formula.
assign(paste0("Model_", i, "_", j), glm(f, family = binomial, data = tophi_data_list[[i]]))
# Preparing modstring1 vector with Cohort, N, N case, N control, and Outcome
modstring1 <- c(names(tophi_data_list)[[i]],
nrow(tophi_data_list[[i]]),
nrow(tophi_data_list[[i]] %>% filter(TOPHIGOUT)),
nrow(tophi_data_list[[i]] %>% filter(!TOPHIGOUT)),
"Tophi")
# Making empty list.
modstring <- list()
# If the predictor is a PRS.
if(any(str_detect(prsnames[[j]], "PRS"))) {
# Add Predictors, Predictor, Covariates, OR, LCL, UCL, Pval, log-odds, and standard error columns to the model summary.
modstring[[1]] <- c(modstring1,
paste(prsnames[[j]], collapse = " + "),
paste(prsnames[[j]], collapse = " + "),
paste(covariates, collapse = " + "),
OR(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
LCL_OR(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
UCL_OR(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
Pval(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
coef(get(paste0("Model_", i, "_", j)))[[prsnames[[j]]]],
summary(get(paste0("Model_", i, "_", j)))$coefficients[prsnames[[j]], 2])
} else{
# If the predictor is not a PRS (i.e. if it is the two variants), then for each variant.
for(k in 1:length(prsnames[[j]])){
# Add Predictors, Predictor, Covariates, OR, LCL, UCL, Pval, log-odds, and standard error columns to the model summary.
modstring[[k]] <- c(modstring1,
paste(prsnames[[j]], collapse = " + "),
paste(prsnames[[j]][k], collapse = " + "),
paste(covariates, collapse = " + "),
OR(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
LCL_OR(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
UCL_OR(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
Pval(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
coef(get(paste0("Model_", i, "_", j)))[[prsnames[[j]][k]]],
summary(get(paste0("Model_", i, "_", j)))$coefficients[prsnames[[j]][k], 2])
}
}
# Save the modstring object as an element of tmplist.
tmplist[[j]] <- modstring
# Making variables vector.
variables <- c(prsnames[[j]], covariates2)
# Preparing a formula object.
f <- as.formula(paste("TOPHIGOUT", paste(variables, collapse = " + "), sep = " ~ "))
# Running a logistic model based on the formula.
assign(paste0("Model_", i, "_", j), glm(f, family = binomial, data = tophi_data_list[[i]]))
# Preparing modstring1 vector with Cohort, N, N case, N control, and Outcome
modstring1 <- c(names(tophi_data_list)[[i]],
nrow(tophi_data_list[[i]]),
nrow(tophi_data_list[[i]] %>% filter(TOPHIGOUT)),
nrow(tophi_data_list[[i]] %>% filter(!TOPHIGOUT)),
"Tophi")
# Making empty list.
modstring <- list()
# If the predictor is a PRS.
if(any(str_detect(prsnames[[j]], "PRS"))) {
# Add Predictors, Predictor, Covariates, OR, LCL, UCL, Pval, log-odds, and standard error columns to the model summary.
modstring[[1]] <- c(modstring1,
paste(prsnames[[j]], collapse = " + "),
paste(prsnames[[j]], collapse = " + "),
paste(covariates2, collapse = " + "),
OR(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
LCL_OR(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
UCL_OR(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
Pval(get(paste0("Model_", i, "_", j)), prsnames[[j]]),
coef(get(paste0("Model_", i, "_", j)))[[prsnames[[j]]]],
summary(get(paste0("Model_", i, "_", j)))$coefficients[prsnames[[j]], 2])
} else{
# If the predictor is not a PRS (i.e. if it is the two variants), then for each variant.
for(k in 1:length(prsnames[[j]])){
# Add Predictors, Predictor, Covariates, OR, LCL, UCL, Pval, log-odds, and standard error columns to the model summary.
modstring[[k]] <- c(modstring1,
paste(prsnames[[j]], collapse = " + "),
paste(prsnames[[j]][k], collapse = " + "),
paste(covariates2, collapse = " + "),
OR(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
LCL_OR(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
UCL_OR(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
Pval(get(paste0("Model_", i, "_", j)), prsnames[[j]][k]),
coef(get(paste0("Model_", i, "_", j)))[[prsnames[[j]][k]]],
summary(get(paste0("Model_", i, "_", j)))$coefficients[prsnames[[j]][k], 2])
}
}
# Add the model summaries to tmplist.
tmplist[[j + length(prsnames)]] <- modstring
}
# Save the model summary list for each cohort to modlist.
modlist[[i]] <- tmplist
}
# Cleaning up.
rm(list = ls()[str_detect(ls(), "Model_")])
# Collapsing the lists to create a dataframe with all results for tophi models.
tmp <- modlist %>%
flatten() %>%
flatten() %>%
as.data.frame() %>%
data.table::transpose()
# Setting the column names of the dataframe.
colnames(tmp) <- c("Cohort", "N", "N case", "N control", "Outcome", "Predictors", "Predictor", "Covariates", "OR", "LCL", "UCL", "Pval", "log-odds", "SE")
# Setting the column types.
TophiModels <- tmp %>%
mutate(across(c(Cohort, Outcome, Predictors, Predictor, Covariates), factor),
across(c(N, `N case`, `N control`, OR, LCL, UCL, Pval, `log-odds`, SE), as.numeric))
# Saving the TophiModels dataframe.
save(TophiModels, file = here("Output/TophiModels.RData"))
# Cleaning up.
rm(modlist, tmp, tmplist, covariates, covariates2, f, i, j, k, modstring, modstring1, prsnames, variables)
He Replication
Below I investigate the reason for the discordant results in He et al., 2017 compared to our results. They only used the Genetics of Gout in Aotearoa cohort, split into European, East Polynesian and West Polynesian (though Polynesians were also combined). They state they have 1,778 total gout cases, testing associations of SLC2A9 rs11942223, ABCG2 rs2231142, and ABCG2 rs10011796 with tophaceous gout. They show no association of SLC2A9 with tophi. They show a general association of both ABCG2 variants with tophi. They show a Polynesian association of rs2231142 with tophi + a West Polynesian specific effect. They show a West Polynesian specific association of rs10011796 with tophi.
I was able to reproduce their results, though I found that only under the dominance model did the West Polynesian effect of rs10011796 show up. Overall, this was an intriguing biological dominance effect that stratifies by ancestry.
# Loading CoreExome QC 1-10 phenotype file into R (this was made by Tanya).
CoreExPheno <- read_delim("/Volumes/archive/merrimanlab/raid_backup/New_Zealand_Chip_data/CoreExome/QC_MergedBatches/Final_Data/CZ-MB1.2-QC1.10_MergedPhenotypes_20082020.txt", delim = "\t") %>%
mutate(across(where(is_character), factor))
# Loading IDs of individuals genotyped on the CoreExome chip.
All_CoreEx_ID <- read_delim("/Volumes/archive/merrimanlab/raid_backup/New_Zealand_Chip_data/CoreExome/QC_MergedBatches/Final_Data/CZ-MB1.2-QC1.10_CoreExome24-1.0-3_genotyped-QCd_rsIDconverted.fam", delim = " ", col_names = F)
# Making filtered coreexome dataset for replication leaves 2,449 gout cases (2,030 belong to Gout in Aotearoa, with 39 labeled as Gout in Aotearoa (Old)).
CoreExPheno_Filtered <- CoreExPheno %>%
filter(Geno.BroadAncestry %in% c("Oceanian", "European"),
Geno.SampleID %in% All_CoreEx_ID$X2,
General.Use != "No",
Pheno.Study %in% c("Diabetes Mellitus", "Gout in Aotearoa", "Gout in Aotearoa (Old)", "Ngati Porou", "Renal Disease"),
Pheno.GoutSummary == "Gout") %>%
mutate(across(where(is.factor), factor)) %>%
select(Pheno.SampleID:Pheno.UrateTherapy, GenStudio.ChipType, GenStudio.CallRate:Notes)
# Cleaning up.
rm(CoreExPheno, All_CoreEx_ID)
# Making function for modifying boolean variables.
logicfactor <- function(x) {
as.logical(factor(x, levels = c(1, 2), labels = c("FALSE", "TRUE")))
}
# Extracting just gout in aotearoa individuals.
tmp <- CoreExPheno_Filtered %>%
filter(Pheno.Study == "Gout in Aotearoa")
# Adding phenotypes to Gout in Aotearoa samples.
aotearoa_pheno <- read_delim(here("Data/Phenotypes/NZPheno.txt"), delim = "\t", guess_max = 5000) %>%
filter(SUBJECT %in% tmp$Pheno.SampleID) %>%
select(SUBJECT, DATEARR, DOB, AGECOL, DIABETES, FAMGOUT, FAMGOUT3:HIBP, HIBPTREAT:FRUSEMIDE, BUMETANIDE, THIAZIDEDIURETIC:BENDROFLUAZIDE, HCTZ, METOLAZONE, CHLORHALIDONE, INDAPAMIDE, OTHDIURETIC, SPIRONOLACTONE, AMILORIDE, ACETAZOLAMIDE, DIURETICCOMMENT:DIURRECRUIT, LIPIDS, LIPIDLOWER:BILEACIDSEQ, HEART:STROKE, KIDNEY:HEALTHOTH, SUGDRINK, SMOKER:OTHALCO, WEIGHT:HEIGHT, BMI:BMICALC, MRURATE:MRCREATDATE, GOUTCRITERIAB, SUSTOPHUS:DIURGOUT, ALLOPCURRENT, PROBENCURRENT, BENZBROCURRENT, FEBUXCURRENT, OTHULTCURRENT, CURULTCOMMENT:ALLOPINTOLERANCE, ALLOPSIDE, URATEDOX:HIGHESTSUDATE, CHOLES:TRIGLY, SCREAT:SURICACID, URATE1MONTH, RELATEDFILTER:RELATED) %>%
left_join(tmp, by = c("SUBJECT" = "Pheno.SampleID")) %>%
rename(IID = SUBJECT,
GOUT = Pheno.GoutSummary,
SEX = Geno.GeneticSex) %>%
mutate(across(where(is_character), factor),
across(c(FAMGOUT, FAMGOUT3, HIBP, DIURETIC:ACETAZOLAMIDE, LIPIDS:KIDNEY, FATTYLIVER, GOUTCRITERIAB:SUSTOPHUS, TOPHUS, ALLOPCURRENT:OTHULTCURRENT, ALLOPINTOLERANCE),
logicfactor),
AGE1ATK = case_when(!is.na(AGE1ATK) ~ AGE1ATK,
TRUE ~ AGECOL - DURATION),
DURATION = AGECOL - AGE1ATK + 1,
NUMATK = NUMATK,
TOPHIGOUT = GOUTCRITERIAB | SUSTOPHUS | TOPHUS,
EROSIONS = NA,
URATE = case_when(is.na(SURICACID) ~ rowMeans(across(c(MRURATE, URATEDOX, PREULTURATE, HIGHESTSU, URATE1MONTH)), na.rm = TRUE) * 1000 / 59.48,
TRUE ~ SURICACID * 1000 / 59.48),
ULT = ALLOPCURRENT | PROBENCURRENT | BENZBROCURRENT | FEBUXCURRENT | OTHULTCURRENT,
PROPHY = NA,
HEIGHT = HEIGHT / 100,
BMI = case_when(!is.na(BMI) ~ BMI,
TRUE ~ WEIGHT / (HEIGHT * HEIGHT)),
HYPERTENSION = HIBP | !is.na(HIBPTREAT) | DIURETIC | DIURETICCURRENT | LOOPDIURETIC | FRUSEMIDE | BUMETANIDE | THIAZIDEDIURETIC | BENDROFLUAZIDE | HCTZ | METOLAZONE | CHLORHALIDONE | INDAPAMIDE | OTHDIURETIC | SPIRONOLACTONE | AMILORIDE | ACETAZOLAMIDE | !is.na(DIURETICCOMMENT) | DIURRECRUIT == 2 | DIURGOUT %in% 2:4,
DIABETES = DIABETES == 2,
HEART = HEART | ANGINA | HEARTFAILURE | HEARTSURGERY | HEARTATTACK,
CREAT = rowMeans(across(c(SCREAT, MRCREAT))) / 88.42,
EGFR = case_when(SEX == "Male" ~ 175 * (CREAT ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (CREAT ^ -1.154) * (AGECOL ^ -0.203) * 0.742),
KIDNEY = KIDNEY | !is.na(KIDNEY2) | EGFR < 60,
LIPIDS = LIPIDS | LIPIDLOWER | STATIN | FIBRATES | EZETIMIBE | NICOTINICACID | BILEACIDSEQ,
STROKE = STROKE,
TOTALALC = rowSums(across(c(BEER, WINE, SPIRITS)), na.rm = TRUE),
SUGDRINK = SUGDRINK,
CURSMOKE = SMOKER == 2,
FAMGOUT = FAMGOUT,
FAMGOUTNUM = FAMGOUT4) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Further filtering to remove those missing tophaceous gout status leaves 1,662 individuals, which is not far off their numbers, but I will add DM, RD, and NPH data.
test <- aotearoa_pheno %>%
filter(!is.na(TOPHIGOUT))
# Filtering coreexpheno file to only include DM samples.
tmp <- CoreExPheno_Filtered %>%
filter(Pheno.Study == "Diabetes Mellitus")
# Extracting Diabetes Mellitus cohort information.
dm_pheno <- read_delim(here("Data/Phenotypes/DMPheno.txt"), delim = "\t") %>%
filter(PATIENT %in% tmp$Pheno.SampleID) %>%
select(PATIENT, DOB, DATECOL, AGECOL, DIABETES:DIABETESTREAT, FAMGOUT:HIBPTREAT, LIPIDS, HEART:STROKE, KIDNEY:KIDNEY2, SUGDRINK, SMOKER:OTHALCO, WEIGHT, HEIGHT, BMI, URATE:CREAT, DIURETIC:OTHDIURETIC, LIPIDLOWER:BILEACIDSEQ, COMMENT, GOUTCRITERIAB, SUSTOPHUS:OTHDRUG, URATEDOX:DATEDOX, DIABETESAFFSTAT, KIDNEYTRANSPLANT, RENALDISEASE, FASTING:TRIGLY, SURICACID:EGFR) %>%
left_join(tmp, by = c("PATIENT" = "Pheno.SampleID")) %>%
rename(IID = PATIENT,
GOUT = Pheno.GoutSummary,
SEX = Geno.GeneticSex) %>%
mutate(across(where(is_character), factor),
across(c(DIABETES, FAMGOUT, FAMGOUT3, HIBP, LIPIDS, HEART:STROKE, KIDNEY, DIURETIC:OTHDIURETIC, LIPIDLOWER:BILEACIDSEQ, GOUTCRITERIAB, SUSTOPHUS, TOPHUS, ALLOP:COLCHI, DIABETESAFFSTAT, KIDNEYTRANSPLANT, RENALDISEASE), logicfactor),
DURATION = AGECOL - AGE1ATK + 1,
TOPHIGOUT = TOPHUS | GOUTCRITERIAB | SUSTOPHUS,
EROSIONS = NA,
URATE = case_when(!is.na(SURICACID) ~ SURICACID * 1000 / 59.48,
!is.na(URATE) ~ URATE * 1000 / 59.48,
TRUE ~ URATEDOX * 1000 / 59.48),
ULT = ALLOP | PROBEN,
PROPHY = STEROID | ANTIINFLAM | COLCHI | OTHDRUG != "no",
HEIGHT = HEIGHT / 100,
BMI = case_when(!is.na(BMI) ~ BMI,
TRUE ~ WEIGHT / (HEIGHT * HEIGHT)),
HYPERTENSION = HIBP | !is.na(HIBPTREAT) | DIURETIC | LOOPDIURETIC | THIAZIDEDIURETIC | OTHDIURETIC | DIURGOUT,
DIABETES = DIABETES | !is.na(DIABETESTREAT) | DIABETESAFFSTAT,
HEART = HEART | ANGINA | HEARTFAILURE | HEARTSURGERY | HEARTATTACK,
CREAT = CREAT / 88.42,
SCREAT = SCREAT / 88.42,
CREAT2 = rowMeans(across(c(CREAT, SCREAT)), na.rm = T),
EGFR = case_when(SEX == "Male" ~ 175 * (CREAT2 ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (CREAT2 ^ -1.154) * (AGECOL ^ -0.203) * 0.742,
TRUE ~ EGFR),
KIDNEY = KIDNEY | !is.na(KIDNEY2) | EGFR < 60 | KIDNEYTRANSPLANT | RENALDISEASE,
LIPIDS = LIPIDS | LIPIDLOWER | STATIN | FIBRATES | EZETIMIBE | NICOTINICACID | BILEACIDSEQ,
STROKE = STROKE,
TOTALALC = rowSums(across(c(BEER, WINE, SPIRITS)), na.rm = TRUE),
SUGDRINK = SUGDRINK,
CURSMOKE = SMOKER == 2,
FAMGOUT = FAMGOUT | FAMGOUT3,
FAMGOUTNUM = FAMGOUT4) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Filtering coreexpheno file to only include NP samples.
tmp <- CoreExPheno_Filtered %>%
filter(Pheno.Study == "Ngati Porou")
# Extracting Ngati Porou cohort information.
nph_pheno <- read_delim(here("Data/Phenotypes/NPHPheno.txt"), delim = "\t") %>%
filter(PATIENT %in% tmp$Pheno.SampleID) %>%
select(PATIENT, DOB, CONSENT, DATEARR, AGECOL, DIABETES, FAMGOUT:HIBP, LIPIDS, LIPIDLOWER:STROKE, KIDNEY, SUGDRINK, SMOKER:SPIRITS, WEIGHT:HEIGHT, BMI, URATE:CREATDATE, DIURETICCURRENT:FRUSEMIDE, BUMETANIDE, BENDROFLUAZIDE, HCTZ, METOLAZONE, CHLORHALIDONE, SPIRONOLACTONE, AMILORIDE, COMMENT, GOUTCRITERIAB, SUSTOPHUS, AGE1ATK:ALLOP, STEROID:OTHDRUG, URATEDOX:DATEDOX, RENALTRANSPLANT, DIABETESAFFSTAT, SURICACID:SCREAT, DIURETIC:OTHDIURETIC, STATIN:BILEACIDSEQ, URATELOWERING) %>%
left_join(tmp, by = c("PATIENT" = "Pheno.SampleID")) %>%
rename(IID = PATIENT,
GOUT = Pheno.GoutSummary,
SEX = Geno.GeneticSex) %>%
mutate(across(where(is_character), factor),
across(c(DIABETES, FAMGOUT, FAMGOUT3, HIBP:STROKE, KIDNEY, DIURETICCURRENT:AMILORIDE, GOUTCRITERIAB, SUSTOPHUS, TOPHUS, ALLOP:BENZOBROMARONE, RENALTRANSPLANT, DIABETESAFFSTAT, DIURETIC:URATELOWERING),
logicfactor),
AGE1ATK = case_when(!is.na(AGE1ATK) ~ AGE1ATK,
TRUE ~ AGECOL - DURATION),
DURATION = AGECOL - AGE1ATK + 1,
NUMATK = NUMATK,
TOPHIGOUT = GOUTCRITERIAB | SUSTOPHUS | TOPHUS,
EROSIONS = NA,
URATE = case_when(is.na(SURICACID) ~ rowMeans(across(c(URATE, URATEDOX)), na.rm = TRUE) * 1000 / 59.48,
TRUE ~ SURICACID * 1000 / 59.48),
ULT = ALLOP | PROBEN | BENZOBROMARONE | URATELOWERING,
PROPHY = STEROID | ANTIINFLAM | COLCHI,
HEIGHT = HEIGHT / 100,
BMI = WEIGHT / (HEIGHT * HEIGHT),
HYPERTENSION = HIBP | DIURETICCURRENT | FRUSEMIDE | BUMETANIDE | BENDROFLUAZIDE | HCTZ | METOLAZONE | CHLORHALIDONE | SPIRONOLACTONE | AMILORIDE | DIURGOUT %in% 2:4 | DIURETIC | LOOPDIURETIC | THIAZIDEDIURETIC | OTHDIURETIC,
DIABETES = DIABETES | DIABETESAFFSTAT,
HEART = HEART | ANGINA | HEARTFAILURE | HEARTSURGERY | HEARTATTACK,
CREAT = rowMeans(across(c(CREAT, SCREAT)), na.rm = TRUE) / 88.42,
EGFR = case_when(SEX == "Male" ~ 175 * (CREAT ^ -1.154) * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (CREAT ^ -1.154) * (AGECOL ^ -0.203) * 0.742),
KIDNEY = KIDNEY | EGFR < 60 | RENALTRANSPLANT,
LIPIDS = LIPIDS | LIPIDLOWER | STATIN | FIBRATES | EZETIMIBE | NICOTINICACID | BILEACIDSEQ,
STROKE = STROKE,
TOTALALC = rowSums(across(c(BEER, WINE, SPIRITS)), na.rm = TRUE),
SUGDRINK = SUGDRINK,
CURSMOKE = SMOKER == 2,
FAMGOUT = FAMGOUT,
FAMGOUTNUM = FAMGOUT4) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Filtering coreexpheno file to only include RD samples.
tmp <- CoreExPheno_Filtered %>%
filter(Pheno.Study == "Renal Disease")
# Extracting Renal Disease cohort information.
rd_pheno <- read_delim(here("Data/Phenotypes/RDPheno.txt"), delim = "\t") %>%
filter(PATIENT %in% tmp$Pheno.SampleID) %>%
select(PATIENT, DOB, CONSENTDATE, DATECOL, DATEARR, CKDV, RENALTRANSPLANT, DIABETES, FAMGOUT, HYPERTENSION, DYSLIPIDAEMIA, IHD, CVA, CHF, HEALTHOTH:WEIGHT, BMI, SMOKER, SUGDRINK, BEER:SPIRITS, COMMENT, TYPE2D, GOUTCRITERIAB, SUSTOPHUS, AGE1ATK:OTHDRUG, ESSENTIALHYPERT, SURICACID:SCREAT, RCOMMENTS) %>%
left_join(tmp, by = c("PATIENT" = "Pheno.SampleID")) %>%
rename(IID = PATIENT,
GOUT = Pheno.GoutSummary,
SEX = Geno.GeneticSex) %>%
mutate(across(where(is_character), factor),
across(c(RENALTRANSPLANT:CHF, TYPE2D:SUSTOPHUS, TOPHUS, ALLOPURINOL:RASBURICASE),
logicfactor),
AGECOL = AGECOL,
AGE1ATK = AGE1ATK,
DURATION = AGECOL - AGE1ATK + 1,
NUMATK = NA,
TOPHIGOUT = GOUTCRITERIAB | SUSTOPHUS | TOPHUS,
EROSIONS = NA,
URATE = case_when(is.na(SURICACID) ~ rowMeans(across(c(URATEFIRSTREC, URATEDOX, URATERECENT)), na.rm = TRUE) * 1000 / 59.48,
TRUE ~ SURICACID * 1000 / 59.48),
ULT = ALLOPURINOL | PROBEN | RASBURICASE,
PROPHY = STEROID | ANTIINFLAM | COLCHI,
HEIGHT = HEIGHT / 100,
BMI = WEIGHT / (HEIGHT * HEIGHT),
HYPERTENSION = HYPERTENSION | ESSENTIALHYPERT == 1 | DIURGOUT %in% 2:4,
DIABETES = DIABETES | TYPE2D,
HEART = IHD | CHF,
EGFR = case_when(SEX == "Male" ~ 175 * (SCREAT / 88.42) ^ -1.154 * (AGECOL ^ -0.203),
SEX == "Female" ~ 175 * (SCREAT / 88.42) ^ -1.154 * (AGECOL ^ -0.203) * 0.742),
KIDNEY = CKDV == 1 | RENALTRANSPLANT | EGFR < 60,
LIPIDS = DYSLIPIDAEMIA,
STROKE = CVA,
TOTALALC = rowSums(across(c(BEER, WINE, SPIRITS)), na.rm = TRUE),
SUGDRINK = SUGDRINK,
CURSMOKE = SMOKER == 2,
FAMGOUT = FAMGOUT,
FAMGOUTNUM = NA) %>%
select(IID, GOUT, AGECOL, SEX, Geno.PCVector1:Geno.PCVector10, Geno.PCVector1_Oc:Geno.PCVector10_Oc, AGE1ATK, DURATION, TOPHIGOUT, EROSIONS, NUMATK, URATE, ULT, PROPHY, Geno.SpecificAncestry, BMI, HYPERTENSION, DIABETES, HEART, KIDNEY, LIPIDS, STROKE, TOTALALC, SUGDRINK, CURSMOKE, FAMGOUT, FAMGOUTNUM, Pheno.Study, Geno.SampleID)
# Combining phenotype files together.
combined_pheno <- rbind(aotearoa_pheno, dm_pheno, nph_pheno, rd_pheno) %>%
mutate(Pheno.Study = factor(Pheno.Study)) %>%
arrange(IID) %>%
filter(!(duplicated(IID) | duplicated(IID, fromLast = TRUE)))
# Removing those missing tophaceous gout data leaves 1,875 individuals which is close to their 1,778 individuals.
combined_clean <- combined_pheno %>%
filter(!is.na(TOPHIGOUT))
# 39.8% tophaceous, which isn't far off their 35.3%.
summary(combined_clean$TOPHIGOUT)
# Loading gout and urate GWAS summary tables.
load(here("Output/UKBB_Gene_OR.RData"))
load(here("Output/Tin_Gene_OR.RData"))
# Making combined variant list with all 90 unique lead variants for both GWAS (corresponding to the PLINK files).
Combined_Gene_OR <- UKBB_Gene_OR %>%
select(CHR, BP, RSID, Effect_Allele, Alternate_Allele) %>%
rbind(Tin_Gene_OR %>% select(CHR, BP, RSID, Effect_Allele, Alternate_Allele)) %>%
arrange(CHR, BP) %>%
unique()
# Reading in PLINK map file for the CoreExome PLINK files with all 90 variants from the combination of the Gout PRS and Tin PRS variant lists.
map <- read_delim(path(scratch_path, "/Output/Temp/SNPs.map"),
delim = "\t",
col_names = FALSE)
# Reading in corresponding PLINK ped file.
x <- read_delim(path(scratch_path, "/Output/Temp/SNPs.ped"),
delim = " ",
col_names = FALSE,
col_types = cols(.default = col_character()))
# Renaming columns of x based on map file.
colnames(x)[1:6] <- c("FID", "IID", "PID", "MID", "SEX", "AFF")
colnames(x)[seq(from = 7, to = ncol(x) - 1, by = 2)] <- str_c(map$X2, "_1")
colnames(x)[seq(from = 8, to = ncol(x), by = 2)] <- str_c(map$X2, "_2")
# Filtering x to only include individuals that are in the combined_clean table.
x <- x %>%
filter(IID %in% (combined_clean$Geno.SampleID))
# Setting the number of columns of x as a variable.
num_cols <- ncol(x)
# Converting character genotypes into numeric genotypes based on risk allele = 1.
# For each variant in Combined_Gene_OR (90 total).
for(i in 1:nrow(Combined_Gene_OR)){
# Modifying the column at 2 * i + 5 (i.e. the column corresponding to the first allele of that variant) to first replace any 0 with NA, then replace the Effect Allele with a 1 and the Alternate allele with a 0. Finally, converting the column to a numeric column.
x[[2 * i + 5]] <- x[[2 * i + 5]] %>%
str_replace("0", NA_character_) %>%
str_replace(Combined_Gene_OR[[i, "Effect_Allele"]], "1") %>%
str_replace(Combined_Gene_OR[[i, "Alternate_Allele"]], "0") %>%
as.numeric()
# Doing the same for the column at 2 * i + 6 (i.e. the column corresponding to the second allele of that variant).
x[[2 * i + 6]] <- x[[2 * i + 6]] %>%
str_replace("0", NA_character_) %>%
str_replace(Combined_Gene_OR[[i, "Effect_Allele"]], "1") %>%
str_replace(Combined_Gene_OR[[i, "Alternate_Allele"]], "0") %>%
as.numeric()
# Making a temporary column that is the sum of both numeric alleles.
x <- x %>%
mutate("TEMP" = (x[[2 * i + 5]] + x[[2 * i + 6]]))
# Renaming the column names such that the temporary column is renamed to the corresponding RSID.
colnames(x) <- c(colnames(x[1:((num_cols - 1) + i)]), Combined_Gene_OR[[i, "RSID"]])
}
# Subsetting the columns to only include the IID column and the newly defined numeric genotype columns.
x <- x %>%
select(2, (num_cols + 1):ncol(x))
# Merging this with the phenotype file for further analysis.
combined_clean_prs <- combined_clean %>%
full_join(x, by = c("Geno.SampleID" = "IID"))
# Modelling SLC2A9 rs4481233 (proxy for SLC2A9 variant rs11942223) vs tophaceous disease. OR = 1.17, p = 0.23.
summary(glm(TOPHIGOUT ~ rs4481233, family = "binomial", data = combined_clean_prs))
# Modelling ABCG2 rs2231142 vs tophaceous disease. OR = 1.34, p = 8e-5.
summary(glm(TOPHIGOUT ~ rs2231142, family = "binomial", data = combined_clean_prs))
# Modelling ABCG2 rs10011796 vs tophaceous disease. OR = 1.06, p = 0.38.
summary(glm(TOPHIGOUT ~ rs10011796, family = "binomial", data = combined_clean_prs))
# Modelling both ABCG2 variants vs tophaceous disease. No change from above.
summary(glm(TOPHIGOUT ~ rs2231142 + rs10011796, family = "binomial", data = combined_clean_prs))
# Combining 2's into just 1's so the SNPs are just presence/absence of risk allele (i.e. dominance model).
tmp <- combined_clean_prs %>%
mutate(rs2231142 = case_when(rs2231142 %in% c(1, 2) ~ 1,
TRUE ~ rs2231142),
rs10011796 = case_when(rs10011796 %in% c(1, 2) ~ 1,
TRUE ~ rs10011796))
# Modelling ABCG2 rs2231142 vs tophaceous disease. OR = 1.51, p = 2e-5.
summary(glm(TOPHIGOUT ~ rs2231142, family = "binomial", data = tmp))
# Modelling ABCG2 rs10011796 vs tophaceous disease. OR = 1.30, p = 0.037.
summary(glm(TOPHIGOUT ~ rs10011796, family = "binomial", data = tmp))
# Modelling both ABCG2 variants vs tophaceous disease. Removes association of rs10011796, dampens rs2231142 effect.
summary(glm(TOPHIGOUT ~ rs2231142 + rs10011796, family = "binomial", data = tmp))
# Repeating using Pearson's chi-square test, first making contingency tables.
conting_table1 <- xtabs(~ TOPHIGOUT + rs2231142, data = tmp)
conting_table2 <- xtabs(~ TOPHIGOUT + rs10011796, data = tmp)
# Now running crosstable analysis, identical to logistic regression.
CrossTable(x = conting_table1,
chisq = TRUE,
expected = TRUE,
fisher = TRUE,
resid = TRUE,
sresid = TRUE,
format = "SPSS")
# Again running crosstable analysis, identical to logistic regression.
CrossTable(x = conting_table2,
chisq = TRUE,
expected = TRUE,
fisher = TRUE,
resid = TRUE,
sresid = TRUE,
format = "SPSS")
# Stratifying cohort into Polynesian ancestry only.
tmp <- combined_clean_prs %>%
mutate(rs2231142 = case_when(rs2231142 %in% c(1, 2) ~ 1,
TRUE ~ rs2231142),
rs10011796 = case_when(rs10011796 %in% c(1, 2) ~ 1,
TRUE ~ rs10011796)) %>%
filter(Geno.SpecificAncestry != "European")
# Modelling ABCG2 rs2231142 vs tophaceous disease. OR = 2.01, p = 1.4e-7.
summary(glm(TOPHIGOUT ~ rs2231142, family = "binomial", data = tmp))
# Modelling ABCG2 rs10011796 vs tophaceous disease. OR = 1.27, p = 0.222.
summary(glm(TOPHIGOUT ~ rs10011796, family = "binomial", data = tmp))
# Modelling both ABCG2 variants vs tophaceous disease. No change.
summary(glm(TOPHIGOUT ~ rs2231142 + rs10011796, family = "binomial", data = tmp))
# Extracting West Polynesian data.
tmp <- combined_clean_prs %>%
mutate(rs2231142 = case_when(rs2231142 %in% c(1, 2) ~ 1,
TRUE ~ rs2231142),
rs10011796 = case_when(rs10011796 %in% c(1, 2) ~ 1,
TRUE ~ rs10011796)) %>%
filter(Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan"))
# Modelling ABCG2 rs2231142 vs tophaceous disease. OR = 2.20, p = 0.00046.
summary(glm(TOPHIGOUT ~ rs2231142, family = "binomial", data = tmp))
# Modelling ABCG2 rs10011796 vs tophaceous disease. OR = 3.29, p = 0.002.
summary(glm(TOPHIGOUT ~ rs10011796, family = "binomial", data = tmp))
# Modelling both ABCG2 variants vs tophaceous disease. Minor change in both.
summary(glm(TOPHIGOUT ~ rs2231142 + rs10011796, family = "binomial", data = tmp))
# Extracting East Polynesian data.
tmp <- combined_clean_prs %>%
mutate(rs2231142 = case_when(rs2231142 %in% c(1, 2) ~ 1,
TRUE ~ rs2231142),
rs10011796 = case_when(rs10011796 %in% c(1, 2) ~ 1,
TRUE ~ rs10011796)) %>%
filter(Geno.SpecificAncestry %in% c("East Polynesian", "East-West Polynesian"))
# Modelling ABCG2 rs2231142 vs tophaceous disease. OR = 1.74, p = 0.011.
summary(glm(TOPHIGOUT ~ rs2231142, family = "binomial", data = tmp))
# Modelling ABCG2 rs10011796 vs tophaceous disease. OR = 0.75, p = 0.23.
summary(glm(TOPHIGOUT ~ rs10011796, family = "binomial", data = tmp))
# Modelling both ABCG2 variants vs tophaceous disease. Minor change in both.
summary(glm(TOPHIGOUT ~ rs2231142 + rs10011796, family = "binomial", data = tmp))
# Given the above mostly replicates their results, repeating with additive model.
# Extracting West Polynesian data.
tmp <- combined_clean_prs %>%
filter(Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan"))
# Modelling ABCG2 rs2231142 vs tophaceous disease. OR = 1.52, p = 0.0032.
summary(glm(TOPHIGOUT ~ rs2231142, family = "binomial", data = tmp))
# Modelling ABCG2 rs10011796 vs tophaceous disease. OR = 1.28, p = 0.093.
summary(glm(TOPHIGOUT ~ rs10011796, family = "binomial", data = tmp))
# Modelling both ABCG2 variants vs tophaceous disease. Minor change in both.
summary(glm(TOPHIGOUT ~ rs2231142 + rs10011796, family = "binomial", data = tmp))
# Given the difference between model types, plotting relationships for ABCG2 rs2231142, showing strong dominance effect in West Polynesians and additive effect in East Polynesians.
combined_clean_prs %>%
mutate(ANCESTRY = factor(case_when(Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian",
Geno.SpecificAncestry %in% c("East Polynesian", "East-West Polynesian") ~ "East Polynesian",
Geno.SpecificAncestry %in% c("European") ~ "European"), levels = c("European", "East Polynesian", "West Polynesian"))) %>%
group_by(rs2231142, ANCESTRY) %>%
summarize(value = mean(TOPHIGOUT)) %>%
ggplot(aes(x = rs2231142, y = value)) +
geom_bar(stat = "identity", fill = "darkgreen") +
labs(x = "Number of ABCG2 rs2231142 gout risk alleles", y = "Proportion with tophi") +
scale_y_continuous(limits = c(0, 1)) +
facet_wrap(~ ANCESTRY) +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank())
# Plotting relationships for ABCG2 rs10011796, showing strong dominance effect in West Polynesians.
combined_clean_prs %>%
mutate(ANCESTRY = factor(case_when(Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian",
Geno.SpecificAncestry %in% c("East Polynesian", "East-West Polynesian") ~ "East Polynesian",
Geno.SpecificAncestry %in% c("European") ~ "European"), levels = c("European", "East Polynesian", "West Polynesian"))) %>%
group_by(rs10011796, ANCESTRY) %>%
summarize(value = mean(TOPHIGOUT)) %>%
ggplot(aes(x = rs10011796, y = value)) +
geom_bar(stat = "identity", fill = "darkgreen") +
labs(x = "Number of ABCG2 rs10011796 gout risk alleles", y = "Proportion with tophi") +
scale_y_continuous(limits = c(0, 1)) +
facet_wrap(~ ANCESTRY) +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank())
# Plotting age at onset as outcome, no evidence of dominance effect.
combined_clean_prs %>%
mutate(ANCESTRY = factor(case_when(Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian",
Geno.SpecificAncestry %in% c("East Polynesian", "East-West Polynesian") ~ "East Polynesian",
Geno.SpecificAncestry %in% c("European") ~ "European"), levels = c("European", "East Polynesian", "West Polynesian"))) %>%
group_by(rs2231142, ANCESTRY) %>%
summarize(value = mean(AGE1ATK, na.rm = T)) %>%
ggplot(aes(x = rs2231142, y = value)) +
geom_bar(stat = "identity", fill = "darkgreen") +
labs(x = "Number of ABCG2 rs2231142 gout risk alleles", y = "Mean age at onset") +
facet_wrap(~ ANCESTRY) +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank())
# Again plotting age at onset as outcome, no evidence of dominance effect.
combined_clean_prs %>%
mutate(ANCESTRY = factor(case_when(Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian",
Geno.SpecificAncestry %in% c("East Polynesian", "East-West Polynesian") ~ "East Polynesian",
Geno.SpecificAncestry %in% c("European") ~ "European"), levels = c("European", "East Polynesian", "West Polynesian"))) %>%
group_by(rs10011796, ANCESTRY) %>%
summarize(value = mean(AGE1ATK, na.rm = T)) %>%
ggplot(aes(x = rs10011796, y = value)) +
geom_bar(stat = "identity", fill = "darkgreen") +
labs(x = "Number of ABCG2 rs10011796 gout risk alleles", y = "Mean age at onset") +
facet_wrap(~ ANCESTRY) +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank())
Making Plots and Tables for the Final Manuscript
The following code details the creation of the final tables and plots that were included in the manuscript (including supplementary material).
These include the following figures:
Figure 1 = Forest plot(s) of genetic effect on age at onset, A = full PRS model, B = PRS less ABCG2.
Figure 2 = Effect of individual ABCG2 variants, A = rs2231142, B = rs10011796.
Figure 3 = Forest plot of PRS effect on presence of tophaceous gout (two panels, before and after adjustment for disease duration).
Supp Figure 1 = Manhattan plot for gout GWAS.
Supp Figure 2 = Tophaceous disease ABCG2 models and PRS without ABCG2.
Supp Figure 3 = Relationship between flares and PRS (plot as faceted scatter plot, and/or as grouped category plots).
And the following tables:
Table 1 = Gout cohort statistics/demographics (only key variables).
Supp Table 1 = Table describing GWAS results and which SNPs are part of the PRS.
Supp Table 2 = Missing data.
Supp Table 3 = Full cohort stats.
Supp Table 4 = All model results in table format.
There are also several other analyses included at the end of the document.
# Datasets
load(path(scratch_path, "Output/all_pheno_prs.RData"))
# Making a categorical flare frequency variable (FLARE_CAT) and setting all control gout severity traits to NA, then preparing the data further for models.
all_pheno_prs <- all_pheno_prs %>%
mutate(FLARE_CAT = factor(case_when(between(NUMATK, 0, 5) ~ paste0(as.character(NUMATK), " flares in last year"),
between(NUMATK, 6, 11) ~ "One every one to two months",
between(NUMATK, 12, 52) ~ "One or more per month"),
levels = c(paste0(0:5, " flares in last year"),
"One every one to two months",
"One or more per month"),
labels = c(paste0(0:5),
"6 - 11",
"12 - 52"),
ordered = TRUE),
AGE1ATK = case_when(GOUT ~ AGE1ATK),
DURATION = case_when(GOUT ~ DURATION),
NUMATK = case_when(GOUT ~ NUMATK),
TOPHIGOUT = case_when(GOUT ~ TOPHIGOUT),
ULT = case_when(GOUT ~ ULT),
SEX = factor(SEX, levels = c("Male", "Female")),
GOUT2 = factor(GOUT, levels = c(TRUE, FALSE), labels = c("Gout", "Control")),
ANCESTRY_GOUT = factor(case_when(GOUT & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "European Gout",
!GOUT & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "European Control",
GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study != "Ngati Porou" ~ "East Polynesian Gout",
!GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study != "Ngati Porou" ~ "East Polynesian Control",
GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study == "Ngati Porou" ~ "East Polynesian Gout - NP",
!GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study == "Ngati Porou" ~ "East Polynesian Control - NP",
GOUT & Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian Gout",
!GOUT & Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian Control"),
levels = c("European Gout",
"European Control",
"East Polynesian Gout",
"East Polynesian Control",
"East Polynesian Gout - NP",
"East Polynesian Control - NP",
"West Polynesian Gout",
"West Polynesian Control")),
ANCESTRY_GOUT_SEX = factor(case_when(ANCESTRY_GOUT == "European Gout" & SEX == "Male" ~ "European Gout - male",
ANCESTRY_GOUT == "European Gout" & SEX == "Female" ~ "European Gout - female",
ANCESTRY_GOUT == "European Control" & SEX == "Male" ~ "European Control - male",
ANCESTRY_GOUT == "European Control" & SEX == "Female" ~ "European Control - female",
ANCESTRY_GOUT == "East Polynesian Gout" & SEX == "Male" ~ "East Polynesian Gout - male",
ANCESTRY_GOUT == "East Polynesian Gout" & SEX == "Female" ~ "East Polynesian Gout - female",
ANCESTRY_GOUT == "East Polynesian Control" & SEX == "Male" ~ "East Polynesian Control - male",
ANCESTRY_GOUT == "East Polynesian Control" & SEX == "Female" ~ "East Polynesian Control - female",
ANCESTRY_GOUT == "East Polynesian Gout - NP" & SEX == "Male" ~ "East Polynesian Gout - NP - male",
ANCESTRY_GOUT == "East Polynesian Gout - NP" & SEX == "Female" ~ "East Polynesian Gout - NP - female",
ANCESTRY_GOUT == "East Polynesian Control - NP" & SEX == "Male" ~ "East Polynesian Control - NP - male",
ANCESTRY_GOUT == "East Polynesian Control - NP" & SEX == "Female" ~ "East Polynesian Control - NP - female",
ANCESTRY_GOUT == "West Polynesian Gout" & SEX == "Male" ~ "West Polynesian Gout - male",
ANCESTRY_GOUT == "West Polynesian Gout" & SEX == "Female" ~ "West Polynesian Gout - female",
ANCESTRY_GOUT == "West Polynesian Control" & SEX == "Male" ~ "West Polynesian Control - male",
ANCESTRY_GOUT == "West Polynesian Control" & SEX == "Female" ~ "West Polynesian Control - female"),
levels = c("European Gout - male",
"European Gout - female",
"European Control - male",
"European Control - female",
"East Polynesian Gout - male",
"East Polynesian Gout - female",
"East Polynesian Control - male",
"East Polynesian Control - female",
"East Polynesian Gout - NP - male",
"East Polynesian Gout - NP - female",
"East Polynesian Control - NP - male",
"East Polynesian Control - NP - female",
"West Polynesian Gout - male",
"West Polynesian Gout - female",
"West Polynesian Control - male",
"West Polynesian Control - female")),
SEX_GOUT = factor(case_when(GOUT & SEX == "Male" ~ "Male Gout",
GOUT & SEX == "Female" ~ "Female Gout",
!GOUT & SEX == "Male" ~ "Male Control",
!GOUT & SEX == "Female" ~ "Female Control"),
levels = c("Male Gout",
"Female Gout",
"Male Control",
"Female Control")),
COHORT_GOUT = factor(case_when(GOUT & Pheno.Study == "UK Biobank" ~ "UK Biobank - Gout",
!GOUT & Pheno.Study == "UK Biobank" ~ "UK Biobank - Control",
GOUT & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") & Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease") ~ "Aus/NZ - Gout",
!GOUT & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") & Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease") ~ "Aus/NZ - Control",
GOUT & Pheno.Study == "EuroGout" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "GlobalGout - Gout",
!GOUT & Pheno.Study == "EuroGout" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "GlobalGout - Control",
Pheno.Study == "Ardea: 401" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "Ardea - LASSO",
Pheno.Study == "Ardea: CLEAR1" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "Ardea - CLEAR1",
Pheno.Study == "Ardea: CLEAR2" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "Ardea - CLEAR2",
Pheno.Study == "Ardea: CRYSTAL" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "Ardea - CRYSTAL",
Pheno.Study == "Ardea: LIGHT" & Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian") ~ "Ardea - LIGHT",
GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study != "Ngati Porou" ~ "East Polynesian - Gout",
!GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study != "Ngati Porou" ~ "East Polynesian - Control",
GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study == "Ngati Porou" ~ "East Polynesian - Gout - NP",
!GOUT & Geno.SpecificAncestry %in% c("East Polynesian") & Pheno.Study == "Ngati Porou" ~ "East Polynesian - Control - NP",
GOUT & Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian - Gout",
!GOUT & Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan") ~ "West Polynesian - Control"),
levels = c("UK Biobank - Gout",
"UK Biobank - Control",
"Aus/NZ - Gout",
"Aus/NZ - Control",
"GlobalGout - Gout",
"GlobalGout - Control",
"Ardea - LASSO",
"Ardea - CLEAR1",
"Ardea - CLEAR2",
"Ardea - CRYSTAL",
"Ardea - LIGHT",
"East Polynesian - Gout",
"East Polynesian - Control",
"East Polynesian - Gout - NP",
"East Polynesian - Control - NP",
"West Polynesian - Gout",
"West Polynesian - Control"))) %>%
filter(!is.na(COHORT_GOUT),
!is.na(AGECOL),
!is.na(PRS),
(GOUT & !(is.na(AGE1ATK) & is.na(NUMATK) & is.na(TOPHIGOUT)) | Pheno.Study == "UK Biobank" | !GOUT))
# Making list with each of the cohort subsets. This will make it easier to manipulate the data for models.
data_list <- list("UK Biobank - Gout - Male" = all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Pheno.Study == "UK Biobank"),
"UK Biobank - Gout - Female" = all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Pheno.Study == "UK Biobank"),
"UK Biobank - Control - Male" = all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Pheno.Study == "UK Biobank"),
"UK Biobank - Control - Female" = all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Pheno.Study == "UK Biobank"),
"Aus/NZ European - Gout - Male" = all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
"Aus/NZ European - Gout - Female" = all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
"Aus/NZ European - Control - Male" = all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
"Aus/NZ European - Control - Female" = all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian"),
Pheno.Study %in% c("AGRIA", "Diabetes Mellitus", "Gout in Aotearoa", "LPA", "Ngati Porou", "Renal Disease")),
"GlobalGout - Gout - Male" = all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"GlobalGout - Gout - Female" = all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"GlobalGout - Control - Male" = all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"GlobalGout - Control - Female" = all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Pheno.Study == "EuroGout",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - LASSO - Male" = all_pheno_prs %>% filter(SEX == "Male",
Pheno.Study == "Ardea: 401",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - LASSO - Female" = all_pheno_prs %>% filter(SEX == "Female",
Pheno.Study == "Ardea: 401",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - CLEAR1 - Male" = all_pheno_prs %>% filter(SEX == "Male",
Pheno.Study == "Ardea: CLEAR1",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - CLEAR1 - Female" = all_pheno_prs %>% filter(SEX == "Female",
Pheno.Study == "Ardea: CLEAR1",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - CLEAR2 - Male" = all_pheno_prs %>% filter(SEX == "Male",
Pheno.Study == "Ardea: CLEAR2",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - CLEAR2 - Female" = all_pheno_prs %>% filter(SEX == "Female",
Pheno.Study == "Ardea: CLEAR2",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - CRYSTAL - Male" = all_pheno_prs %>% filter(SEX == "Male",
Pheno.Study == "Ardea: CRYSTAL",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - CRYSTAL - Female" = all_pheno_prs %>% filter(SEX == "Female",
Pheno.Study == "Ardea: CRYSTAL",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - LIGHT - Male" = all_pheno_prs %>% filter(SEX == "Male",
Pheno.Study == "Ardea: LIGHT",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"Ardea - LIGHT - Female" = all_pheno_prs %>% filter(SEX == "Female",
Pheno.Study == "Ardea: LIGHT",
Geno.SpecificAncestry %in% c("European", "European; Iberian", "Iberian")),
"East Polynesian - Gout - Male" = all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
"East Polynesian - Gout - Female" = all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
"East Polynesian - Control - Male" = all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
"East Polynesian - Control - Female" = all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study != "Ngati Porou"),
"East Polynesian - Gout - NP - Male" = all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
"East Polynesian - Gout - NP - Female" = all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
"East Polynesian - Control - NP - Male" = all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
"East Polynesian - Control - NP - Female" = all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("East Polynesian"),
Pheno.Study == "Ngati Porou"),
"West Polynesian - Gout - Male" = all_pheno_prs %>% filter(SEX == "Male",
GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")),
"West Polynesian - Gout - Female" = all_pheno_prs %>% filter(SEX == "Female",
GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")),
"West Polynesian - Control - Male" = all_pheno_prs %>% filter(SEX == "Male",
!GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")),
"West Polynesian - Control - Female" = all_pheno_prs %>% filter(SEX == "Female",
!GOUT,
Geno.SpecificAncestry %in% c("West Polynesian", "Niuean", "Pukapukan")))
cohortstring <- c("UK Biobank - Gout - Male",
"UK Biobank - Gout - Female",
"UK Biobank - Control - Male",
"UK Biobank - Control - Female",
"Aus/NZ European - Gout - Male",
"Aus/NZ European - Gout - Female",
"Aus/NZ European - Control - Male",
"Aus/NZ European - Control - Female",
"GlobalGout - Gout - Male",
"GlobalGout - Gout - Female",
"GlobalGout - Control - Male",
"GlobalGout - Control - Female",
"Ardea - LASSO - Male",
"Ardea - LASSO - Female",
"Ardea - CLEAR1 - Male",
"Ardea - CLEAR1 - Female",
"Ardea - CLEAR2 - Male",
"Ardea - CLEAR2 - Female",
"Ardea - CRYSTAL - Male",
"Ardea - CRYSTAL - Female",
"Ardea - LIGHT - Male",
"Ardea - LIGHT - Female",
"East Polynesian - Gout - Male",
"East Polynesian - Gout - Female",
"East Polynesian - Control - Male",
"East Polynesian - Control - Female",
"East Polynesian - Gout - NP - Male",
"East Polynesian - Gout - NP - Female",
"East Polynesian - Control - NP - Male",
"East Polynesian - Control - NP - Female",
"West Polynesian - Gout - Male",
"West Polynesian - Gout - Female",
"West Polynesian - Control - Male",
"West Polynesian - Control - Female")
clean_names <- c("UK Biobank<br>Gout<br>Male",
"UK Biobank<br>Gout<br>Female",
"UK Biobank<br>Control<br>Male",
"UK Biobank<br>Control<br>Female",
"Aus/NZ European<br>Gout<br>Male",
"Aus/NZ European<br>Gout<br>Female",
"Aus/NZ European<br>Control<br>Male",
"Aus/NZ European<br>Control<br>Female",
"GlobalGout<br>Gout<br>Male",
"GlobalGout<br>Gout<br>Female",
"GlobalGout<br>Control<br>Male",
"GlobalGout<br>Control<br>Female",
"Ardea<br>LASSO<br>Gout<br>Male",
"Ardea<br>LASSO<br>Gout<br>Female",
"Ardea<br>CLEAR1<br>Gout<br>Male",
"Ardea<br>CLEAR1<br>Gout<br>Female",
"Ardea<br>CLEAR2<br>Gout<br>Male",
"Ardea<br>CLEAR2<br>Gout<br>Female",
"Ardea<br>CRYSTAL<br>Gout<br>Male",
"Ardea<br>CRYSTAL<br>Gout<br>Female",
"Ardea<br>LIGHT<br>Gout<br>Male",
"Ardea<br>LIGHT<br>Gout<br>Female",
"East Polynesian<br>Gout<br>Male",
"East Polynesian<br>Gout<br>Female",
"East Polynesian<br>Control<br>Male",
"East Polynesian<br>Control<br>Female",
"East Polynesian<br>Gout<br>Male<br>NP",
"East Polynesian<br>Gout<br>Female<br>NP",
"East Polynesian<br>Control<br>Male<br>NP",
"East Polynesian<br>Control<br>Female<br>NP",
"West Polynesian<br>Gout<br>Male",
"West Polynesian<br>Gout<br>Female",
"West Polynesian<br>Control<br>Male",
"West Polynesian<br>Control<br>Female")
# Loading GWAS summary tables.
load(here("Output/UKBB_Gene_OR.RData"))
load(here("Output/Tin_Gene_OR.RData"))
# Loading model results.
load(here("Output/GoutModels.RData"))
load(here("Output/OnsetModels.RData"))
load(here("Output/TophiModels.RData"))
# Defining functions for use throughout this section.
# This function will take any numeric variable and produce a report of the form "<mean> ± <sd>".
report <- function(x) {
if(sum(is.na(x)) != length(x)) {
paste0(sprintf(mean(x, na.rm = TRUE), fmt = "%#.1f"), " ± ", sprintf(sd(x, na.rm = TRUE), fmt = "%#.1f"))
} else {
paste0("NA")
}
}
# This function will take any numeric variable and produce a report of the form "<median> (<lower quartile> - <upper quartile>)".
report_median <- function(x) {
if(sum(is.na(x)) != length(x)) {
paste0(median(x, na.rm =T), " (", summary(x)[[2]], " - ", summary(x)[[5]], ")")
} else {
paste0("NA")
}
}
# This function will take any TRUE/FALSE variable and produce a report of the form "<N TRUE> (<% TRUE>)".
sumreport <- function(x) {
if(sum(is.na(x)) != length(x)){
paste0(sum(x, na.rm = TRUE), " (", sprintf((mean(x, na.rm = TRUE) * 100), fmt = "%#.1f"), ")")
} else {
paste0("NA")
}
}
# This function transposes a dataframe.
transpose_df <- function(df) {
t_df <- data.table::transpose(df)
colnames(t_df) <- rownames(df)
rownames(t_df) <- colnames(df)
t_df <- t_df %>%
rownames_to_column() %>%
as_tibble() %>%
row_to_names(row_number = 1)
return(t_df)
}
# This function will take any variable and report the missingness as either "All", "None", or the form "<N missing> (<% missing>)".
missing <- function(x){
if(sum(is.na(x)) == length(x)) {
return("All")
} else if(sum(!is.na(x)) == length(x)){
return("None")
} else {
paste0(format(sum(is.na(x)), big.mark = ","), " (", format(round((sum(is.na(x)) / length(x) * 100), digits = 1), nsmall = 1), ")")
}
}
Main figures and tables
The following forest plots are for the effect of the various PRS or individual variants on either age at onset or tophaceous disease, with tophaceous disease models additionally adjusted for duration. The individual variant models require P < 0.05/2 for significance. Where appropriate, these were tested via meta-analysis to reflect the results from the full PRS models.
Initially, I plotted the effect of the PRS, PRS less ABCG2 and the urate PRS on age at onset.
# Preparing age at onset vs PRS model data for meta-analysis.
tmp <- OnsetModels %>%
filter(Predictor == "PRS") %>%
mutate(Sex = case_when(str_detect(Cohort, "Male") ~ "Male",
str_detect(Cohort, "Female") ~ "Female"),
Cohort = str_remove(Cohort, " - Male| - Female"),
Beta = case_when(is.na(LCL) ~ NA_real_,
TRUE ~ Beta),
Label = case_when(!str_detect(Cohort, "Polynesian") & Sex == "Male" ~ "European Male N",
!str_detect(Cohort, "Polynesian") & Sex == "Female" ~ "European Female",
str_detect(Cohort, "East Polynesian") & Sex == "Male" ~ "East Polynesian Male",
str_detect(Cohort, "East Polynesian") & Sex == "Female" ~ "East Polynesian Female",
str_detect(Cohort, "West Polynesian") & Sex == "Male" ~ "West Polynesian Male",
str_detect(Cohort, "West Polynesian") & Sex == "Female" ~ "West Polynesian Female"),
N = format(N, big.mark = ","))
# Running meta-analysis.
onset <- metagen(TE = Beta,
seTE = SE,
studlab = Cohort,
subgroup = Label,
data = tmp)
# Preparing to produce a tiff file of the following plot.
tiff(file = here("Output/Plots/OnsetvsPRS.tiff"), units = "in", width = 7.0, height = 8.25, res = 300)
# Producing the forest plot for output in the tiff.
forest(onset,
xlim = c(-15, 5),
random = F,
print.tau2 = F,
print.I2 = T,
col.square.lines = "darkgray",
col.diamond.lines = "darkgray",
col.diamond = "#1e6b52",
col.square = "#aa9767",
col.study = "gray60",
col.by = "black",
leftcols = c("studlab", "N"),
rightcols = c("effect", "ci"),
leftlabs = c("Effect of Gout PRS on Age at Gout Onset\nIncluding ABCG2\n", ""),
rightlabs = c("Beta (years)", "[95% CI]"),
addrows.below.overall = 2,
overall = F,
print.subgroup.name = F,
test.subgroup = F)
# Saving the tiff file.
dev.off()
# Preparing age at onset vs PRS less ABCG2 model data for meta-analysis.
tmp <- OnsetModels %>%
filter(Predictor == "PRS_noABCG2") %>%
mutate(Sex = case_when(str_detect(Cohort, "Male") ~ "Male",
str_detect(Cohort, "Female") ~ "Female"),
Cohort = str_remove(Cohort, " - Male| - Female"),
Beta = case_when(is.na(LCL) ~ NA_real_,
TRUE ~ Beta),
Label = case_when(!str_detect(Cohort, "Polynesian") & Sex == "Male" ~ "European Male N",
!str_detect(Cohort, "Polynesian") & Sex == "Female" ~ "European Female",
str_detect(Cohort, "East Polynesian") & Sex == "Male" ~ "East Polynesian Male",
str_detect(Cohort, "East Polynesian") & Sex == "Female" ~ "East Polynesian Female",
str_detect(Cohort, "West Polynesian") & Sex == "Male" ~ "West Polynesian Male",
str_detect(Cohort, "West Polynesian") & Sex == "Female" ~ "West Polynesian Female"),
N = format(N, big.mark = ","))
# Running meta-analysis.
onset <- metagen(TE = Beta,
seTE = SE,
studlab = Cohort,
subgroup = Label,
data = tmp)
# Preparing to produce a tiff file of the following plot.
tiff(file = here("Output/Plots/OnsetvsPRS2.tiff"), units = "in", width = 7.0, height = 8.25, res = 300)
# Producing the forest plot for output in the tiff.
forest(onset,
xlim = c(-15, 5),
random = F,
print.tau2 = F,
print.I2 = T,
col.square.lines = "darkgray",
col.diamond.lines = "darkgray",
col.diamond = "#1e6b52",
col.square = "#aa9767",
col.study = "gray60",
col.by = "black",
leftcols = c("studlab", "N"),
rightcols = c("effect", "ci"),
leftlabs = c("Effect of Gout PRS on Age at Gout Onset\nExcluding ABCG2\n", ""),
rightlabs = c("Beta (years)", "[95% CI]"),
addrows.below.overall = 2,
overall = F,
print.subgroup.name = F,
test.subgroup = F)
# Saving the tiff file.
dev.off()
Plotting the effect of the individual ABCG2 variants on age at onset.
# Preparing age at onset vs ABCG2 variants model data for meta-analysis.
tmp <- OnsetModels %>%
filter(str_detect(Predictor, "rs")) %>%
mutate(Sex = case_when(str_detect(Cohort, "Male") ~ "Male",
str_detect(Cohort, "Female") ~ "Female"),
Cohort = str_remove(Cohort, " - Male| - Female"),
Beta = case_when(is.na(LCL) ~ NA_real_,
TRUE ~ Beta),
Label = case_when(!str_detect(Cohort, "Polynesian") & Sex == "Male" ~ "European Male N",
!str_detect(Cohort, "Polynesian") & Sex == "Female" ~ "European Female",
str_detect(Cohort, "East Polynesian") & Sex == "Male" ~ "East Polynesian Male",
str_detect(Cohort, "East Polynesian") & Sex == "Female" ~ "East Polynesian Female",
str_detect(Cohort, "West Polynesian") & Sex == "Male" ~ "West Polynesian Male",
str_detect(Cohort, "West Polynesian") & Sex == "Female" ~ "West Polynesian Female"),
N = format(N, big.mark = ","))
# Making list of unique predictors (i.e. both SNPs)
snps <- unique(as.character(tmp$Predictor))
# For each SNP.
for(i in snps){
# Filtering the tmp file to only include that SNP.
tmp2 <- tmp %>%
filter(Predictor == i)
# Running the meta-analysis for that SNP.
assign(paste0(i), metagen(TE = Beta,
seTE = SE,
studlab = Cohort,
subgroup = Label,
data = tmp2))
}
# Preparing to produce tiff file for the first SNP.
tiff(file = here(paste0("Output/Plots/Onsetvs", snps[1], ".tiff")), units = "in", width = 7.0, height = 8.25, res = 300)
# Producing the forest plot for output in the tiff.
forest(get(snps[1]),
xlim = c(-15, 5),
random = F,
print.tau2 = F,
print.I2 = T,
col.square.lines = "darkgray",
col.diamond.lines = "darkgray",
col.diamond = "#1e6b52",
col.square = "#aa9767",
col.study = "gray60",
col.by = "black",
leftcols = c("studlab", "N"),
rightcols = c("effect", "ci"),
leftlabs = c(paste0("Change in Age at Gout Onset\nper ", snps[1], " allele"), ""),
rightlabs = c("Beta (years)", "[95% CI]"),
addrows.below.overall = 2,
overall = F,
print.subgroup.name = F,
test.subgroup = F)
# Saving the tiff file.
dev.off()
# Preparing to produce tiff file for the second SNP.
tiff(file = here(paste0("Output/Plots/Onsetvs", snps[2], ".tiff")), units = "in", width = 7.0, height = 8.25, res = 300)
# Producing the forest plot for output in the tiff.
forest(get(paste0(snps[2])),
xlim = c(-15, 5),
random = F,
print.tau2 = F,
print.I2 = T,
col.square.lines = "darkgray",
col.diamond.lines = "darkgray",
col.diamond = "#1e6b52",
col.square = "#aa9767",
col.study = "gray60",
col.by = "black",
leftcols = c("studlab", "N"),
rightcols = c("effect", "ci"),
leftlabs = c(paste0("Change in Age at Gout Onset\nper ", snps[2], " allele"), ""),
rightlabs = c("Beta (years)", "[95% CI]"),
addrows.below.overall = 2,
overall = F,
print.subgroup.name = F,
test.subgroup = F)
# Saving the tiff file.
dev.off()
Plotting the effect of the PRS, PRS less ABCG2 and the urate PRS on tophaceous gout.
# Preparing tophaceous gout vs gout PRS (unadjusted) model data for meta-analysis.
tmp <- TophiModels %>%
filter(Predictor == "PRS",
!str_detect(Covariates, "DURATION"),
`N case` > 20,
`N control` > 20) %>%
mutate(Sex = case_when(str_detect(Cohort, "Male") ~ "Male",
str_detect(Cohort, "Female") ~ "Female"),
Cohort = str_remove(Cohort, " - Male| - Female"),
`log-odds` = case_when(LCL == 0 ~ NA_real_,
TRUE ~ `log-odds`),
Label = case_when(!str_detect(Cohort, "Polynesian") & Sex == "Male" ~ "European Male",
!str_detect(Cohort, "Polynesian") & Sex == "Female" ~ "European Female",
str_detect(Cohort, "East Polynesian") & Sex == "Male" ~ "East Polynesian Male",
str_detect(Cohort, "East Polynesian") & Sex == "Female" ~ "East Polynesian Female",
str_detect(Cohort, "West Polynesian") & Sex == "Male" ~ "West Polynesian Male",
str_detect(Cohort, "West Polynesian") & Sex == "Female" ~ "West Polynesian Female"))
# Running meta-analysis.
tophi <- metagen(TE = `log-odds`,
seTE = SE,
studlab = Cohort,
subgroup = Label,
data = tmp,
sm = "OR")
# Preparing to produce tiff file.
tiff(file = here("Output/Plots/TophivsPRS.tiff"), units = "in", width = 6.75, height = 6, res = 300)
# Producing the forest plot for output in the tiff.
forest(tophi,
xlim = c(0.2, 5),
random = F,
print.tau2 = F,
print.I2 = T,
col.square.lines = "darkgray",
col.diamond.lines = "darkgray",
col.diamond = "#1e6b52",
col.square = "#aa9767",
col.study = "gray60",
col.by = "black",
leftcols = c("studlab", "N case", "N control"),
rightcols = c("effect", "ci"),
leftlabs = c("Effect of Gout PRS on\nTophaceous Gout", "N\nwith", "N\nwithout"),
rightlabs = c("OR", "[95% CI]"),
addrows.below.overall = 2,
overall = F,
print.subgroup.name = F,
smlab = "Unadjusted",
test.subgroup = F)
# Saving the tiff file.
dev.off()
# Preparing tophaceous gout vs gout PRS (adjusted) model data for meta-analysis.
tmp <- TophiModels %>%
filter(Predictor == "PRS",
str_detect(Covariates, "DURATION"),
`N case` > 20,
`N control` > 20) %>%
mutate(Sex = case_when(str_detect(Cohort, "Male") ~ "Male",
str_detect(Cohort, "Female") ~ "Female"),
Cohort = str_remove(Cohort, " - Male| - Female"),
`log-odds` = case_when(LCL == 0 ~ NA_real_,
TRUE ~ `log-odds`),
Label = case_when(!str_detect(Cohort, "Polynesian") & Sex == "Male" ~ "European Male",
!str_detect(Cohort, "Polynesian") & Sex == "Female" ~ "European Female",
str_detect(Cohort, "East Polynesian") & Sex == "Male" ~ "East Polynesian Male",
str_detect(Cohort, "East Polynesian") & Sex == "Female" ~ "East Polynesian Female",
str_detect(Cohort, "West Polynesian") & Sex == "Male" ~ "West Polynesian Male",
str_detect(Cohort, "West Polynesian") & Sex == "Female" ~ "West Polynesian Female"))
# Running meta-analysis.
tophi_adj <- metagen(TE = `log-odds`,
seTE = SE,
studlab = Cohort,
subgroup = Label,
data = tmp,
sm = "OR")
# Preparing to produce tiff file.
tiff(file = here("Output/Plots/TophivsPRS_adj.tiff"), units = "in", width = 6.75, height = 6, res = 300)
# Producing the forest plot for output in the tiff.
forest(tophi_adj,
xlim = c(0.2, 5),
random = F,
print.tau2 = F,
print.I2 = T,
col.square.lines = "darkgray",
col.diamond.lines = "darkgray",
col.diamond = "#1e6b52",
col.square = "#aa9767",
col.study = "gray60",
col.by = "black",
leftcols = c("studlab", "N case", "N control"),
rightcols = c("effect", "ci"),
leftlabs = c("Effect of Gout PRS on\nTophaceous Gout", "N\nwith", "N\nwithout"),
rightlabs = c("OR", "[95% CI]"),
addrows.below.overall = 2,
overall = F,
print.subgroup.name = F,
smlab = "Duration Adjusted",
test.subgroup = F)
# Saving the tiff file.
dev.off()
# Preparing tophaceous gout vs gout PRS less ABCG2 (unadjusted) model data for meta-analysis.
tmp <- TophiModels %>%
filter(Predictor == "PRS_noABCG2",
!str_detect(Covariates, "DURATION"),
`N case` > 20,
`N control` > 20) %>%
mutate(Sex = case_when(str_detect(Cohort, "Male") ~ "Male",
str_detect(Cohort, "Female") ~ "Female"),
Cohort = str_remove(Cohort, " - Male| - Female"),
`log-odds` = case_when(LCL == 0 ~ NA_real_,
TRUE ~ `log-odds`),
Label = case_when(!str_detect(Cohort, "Polynesian") & Sex == "Male" ~ "European Male",
!str_detect(Cohort, "Polynesian") & Sex == "Female" ~ "European Female",
str_detect(Cohort, "East Polynesian") & Sex == "Male" ~ "East Polynesian Male",
str_detect(Cohort, "East Polynesian") & Sex == "Female" ~ "East Polynesian Female",
str_detect(Cohort, "West Polynesian") & Sex == "Male" ~ "West Polynesian Male",
str_detect(Cohort, "West Polynesian") & Sex == "Female" ~ "West Polynesian Female"))
# Running meta-analysis.
tophi <- metagen(TE = `log-odds`,
seTE = SE,
studlab = Cohort,
subgroup = Label,
data = tmp,
sm = "OR")
# Preparing to produce tiff file.
tiff(file = here("Output/Plots/TophivsPRS2.tiff"), units = "in", width = 6.75, height = 6, res = 300)
# Producing the forest plot for output in the tiff.
forest(tophi,
xlim = c(0.2, 5),
random = F,
print.tau2 = F,
print.I2 = T,
col.square.lines = "darkgray",
col.diamond.lines = "darkgray",
col.diamond = "#1e6b52",
col.square = "#aa9767",
col.study = "gray60",
col.by = "black",
leftcols = c("studlab", "N case", "N control"),
rightcols = c("effect", "ci"),
leftlabs = c("Effect of Gout PRS on\nTophaceous Gout", "N\nwith", "N\nwithout"),
rightlabs = c("OR", "[95% CI]"),
addrows.below.overall = 2,
overall = F,
print.subgroup.name = F,
smlab = "Unadjusted\nExcluding ABCG2",
test.subgroup = F)
# Saving the tiff file.
dev.off()
Plotting the effect of the ABCG2 variants on tophaceous gout.
# Preparing tophaceous gout vs rs2231142 (unadjusted) model data for meta-analysis.
tmp <- TophiModels %>%
filter(Predictor == "rs2231142",
!str_detect(Covariates, "DURATION"),
`N case` > 20,
`N control` > 20) %>%
mutate(Sex = case_when(str_detect(Cohort, "Male") ~ "Male",
str_detect(Cohort, "Female") ~ "Female"),
Cohort = str_remove(Cohort, " - Male| - Female"),
`log-odds` = case_when(LCL == 0 ~ NA_real_,
TRUE ~ `log-odds`),
Label = case_when(!str_detect(Cohort, "Polynesian") & Sex == "Male" ~ "European Male",
!str_detect(Cohort, "Polynesian") & Sex == "Female" ~ "European Female",
str_detect(Cohort, "East Polynesian") & Sex == "Male" ~ "East Polynesian Male",
str_detect(Cohort, "East Polynesian") & Sex == "Female" ~ "East Polynesian Female",
str_detect(Cohort, "West Polynesian") & Sex == "Male" ~ "West Polynesian Male",
str_detect(Cohort, "West Polynesian") & Sex == "Female" ~ "West Polynesian Female"))
# Running meta-analysis.
tophi <- metagen(TE = `log-odds`,
seTE = SE,
studlab = Cohort,
subgroup = Label,
data = tmp,
sm = "OR")
# Preparing to produce tiff file.
tiff(file = here("Output/Plots/Tophivsrs2231142.tiff"), units = "in", width = 6.75, height = 6, res = 300)
# Producing the forest plot for output in the tiff.
forest(tophi,
xlim = c(0.2, 5),
random = F,
print.tau2 = F,
print.I2 = T,
col.square.lines = "darkgray",
col.diamond.lines = "darkgray",
col.diamond = "#1e6b52",
col.square = "#aa9767",
col.study = "gray60",
col.by = "black",
leftcols = c("studlab", "N case", "N control"),
rightcols = c("effect", "ci"),
leftlabs = c("Effect of rs2231142 on\nTophaceous Gout", "N\nwith", "N\nwithout"),
rightlabs = c("OR", "[95% CI]"),
addrows.below.overall = 2,
overall = F,
print.subgroup.name = F,
smlab = "",
test.subgroup = F)
# Saving the tiff file.
dev.off()
# Preparing tophaceous gout vs rs10011796 (unadjusted) model data for meta-analysis.
tmp <- TophiModels %>%
filter(Predictor == "rs10011796",
!str_detect(Covariates, "DURATION"),
`N case` > 20,
`N control` > 20) %>%
mutate(Sex = case_when(str_detect(Cohort, "Male") ~ "Male",
str_detect(Cohort, "Female") ~ "Female"),
Cohort = str_remove(Cohort, " - Male| - Female"),
`log-odds` = case_when(LCL == 0 ~ NA_real_,
TRUE ~ `log-odds`),
Label = case_when(!str_detect(Cohort, "Polynesian") & Sex == "Male" ~ "European Male",
!str_detect(Cohort, "Polynesian") & Sex == "Female" ~ "European Female",
str_detect(Cohort, "East Polynesian") & Sex == "Male" ~ "East Polynesian Male",
str_detect(Cohort, "East Polynesian") & Sex == "Female" ~ "East Polynesian Female",
str_detect(Cohort, "West Polynesian") & Sex == "Male" ~ "West Polynesian Male",
str_detect(Cohort, "West Polynesian") & Sex == "Female" ~ "West Polynesian Female"))
# Running meta-analysis.
tophi <- metagen(TE = `log-odds`,
seTE = SE,
studlab = Cohort,
subgroup = Label,
data = tmp,
sm = "OR")
# Preparing to produce tiff file.
tiff(file = here("Output/Plots/Tophivsrs10011796.tiff"), units = "in", width = 6.75, height = 6, res = 300)
# Producing the forest plot for output in the tiff.
forest(tophi,
xlim = c(0.2, 5),
random = F,
print.tau2 = F,
print.I2 = T,
col.square.lines = "darkgray",
col.diamond.lines = "darkgray",
col.diamond = "#1e6b52",
col.square = "#aa9767",
col.study = "gray60",
col.by = "black",
leftcols = c("studlab", "N case", "N control"),
rightcols = c("effect", "ci"),
leftlabs = c("Effect of rs10011796 on\nTophaceous Gout", "N\nwith", "N\nwithout"),
rightlabs = c("OR", "[95% CI]"),
addrows.below.overall = 2,
overall = F,
print.subgroup.name = F,
smlab = "",
test.subgroup = F)
# Saving the tiff file.
dev.off()
The following table describes the cohort statistics for every variable that I have deemed to have sufficient non-missing data. If it is missing at more than 50% in a single cohort then that cohort will be set to “too much missing”. If there are no cohorts with fewer than 30% missing (excluding UKBB) then the variable is removed from the plot.
# Preparing data_list for table of demographics.
data_list2 <- data_list
# Removing UK Biobank and control cohorts.
for(i in length(data_list2):1){
if(str_detect(names(data_list2)[[i]], "UK Biobank|Control")){
data_list2[[i]] <- NULL
}
}
# Removing cohorts with fewer than 20 individuals.
for(i in length(data_list2):1){
if(nrow(data_list2[[i]]) < 20){
data_list2[[i]] <- NULL
}
}
# Producing table summarizing the cohorts of interest for the variables of interest.
table1 <- tibble("Cohort" = cohortstring[which(cohortstring %in% names(data_list2))],
"N" = unlist(lapply(data_list2, function(x) format(nrow(x), big.mark = ","))),
"Age at\nrecruitment,\nyears" = unlist(lapply(data_list2, function(x) report(x$AGECOL))),
"Age at onset,\nyears" = unlist(lapply(data_list2, function(x) report(x$AGE1ATK))),
"Disease\nduration,\nyears" = unlist(lapply(data_list2, function(x) report(x$DURATION))),
"Flares in\nprevious\nyear" = unlist(lapply(data_list2, function(x) report_median(x$NUMATK))),
"Tophaceous\ndisease" = unlist(lapply(data_list2, function(x) sumreport(x$TOPHIGOUT))),
"Gout PRS" = unlist(lapply(data_list2, function(x) report(x$PRS))),
"Urate PRS" = unlist(lapply(data_list2, function(x) report(x$Urate_PRS))))
# Transposing the table and preparing it for printing.
table1 <- transpose_df(table1) %>%
column_to_rownames(var = "Cohort") %>%
mutate(across(.cols = 1:ncol(table1), ~ str_replace_all(string = .x, pattern = " ", replacement = " ")))
# Further preparing the table for printing.
row.names(table1) <- str_replace_all(row.names(table1), " ", " ")
# Printing the table.
table1 %>%
kable(col.names = clean_names[which(cohortstring %in% names(data_list2))],
align = "c",
escape = F) %>%
kable_styling("striped") %>%
scroll_box(width = "900px", height = "475px") %>%
footnote("Flares reported as median (inter-quartile range). All other numeric variables reported as mean ± sd. All binary variables reported as N (%).")
|
|
Aus/NZ European Gout Male
|
Aus/NZ European Gout Female
|
GlobalGout Gout Male
|
GlobalGout Gout Female
|
Ardea LASSO Gout Male
|
Ardea LASSO Gout Female
|
Ardea CLEAR1 Gout Male
|
Ardea CLEAR2 Gout Male
|
Ardea CRYSTAL Gout Male
|
Ardea LIGHT Gout Male
|
East Polynesian Gout Male
|
East Polynesian Gout Female
|
East Polynesian Gout Male NP
|
East Polynesian Gout Female NP
|
West Polynesian Gout Male
|
West Polynesian Gout Female
|
|
N
|
978
|
210
|
1,032
|
124
|
819
|
65
|
230
|
239
|
175
|
109
|
408
|
122
|
124
|
28
|
436
|
54
|
|
Age at recruitment, years
|
62.4 ± 12.4
|
70.0 ± 12.7
|
60.1 ± 13.1
|
67.6 ± 11.1
|
51.4 ± 11.8
|
60.7 ± 10.6
|
52.3 ± 11.1
|
53.0 ± 10.8
|
53.9 ± 11.0
|
53.3 ± 11.8
|
54.3 ± 12.4
|
60.7 ± 11.7
|
59.7 ± 11.3
|
59.1 ± 13.3
|
47.5 ± 12.3
|
53.4 ± 13.4
|
|
Age at onset, years
|
46.4 ± 15.8
|
59.5 ± 15.7
|
46.5 ± 14.0
|
57.8 ± 12.5
|
41.4 ± 13.4
|
55.1 ± 12.0
|
41.9 ± 12.4
|
42.6 ± 13.2
|
40.1 ± 13.0
|
42.4 ± 13.1
|
37.9 ± 14.0
|
49.4 ± 15.4
|
39.1 ± 15.2
|
46.0 ± 16.8
|
34.6 ± 12.0
|
44.3 ± 15.0
|
|
Disease duration, years
|
16.8 ± 12.7
|
10.9 ± 10.4
|
14.5 ± 11.4
|
10.6 ± 9.8
|
11.0 ± 9.4
|
6.6 ± 8.0
|
11.4 ± 9.4
|
11.4 ± 9.8
|
14.8 ± 10.0
|
11.9 ± 8.7
|
17.2 ± 12.8
|
13.1 ± 13.2
|
21.7 ± 15.3
|
14.1 ± 12.6
|
13.6 ± 10.3
|
9.2 ± 9.2
|
|
Flares in previous year
|
2 (0 - 4)
|
1.5 (0 - 3.25)
|
2 (1 - 4)
|
2.5 (1 - 4)
|
4 (3 - 8)
|
3 (3 - 6)
|
3 (2 - 6)
|
4 (2 - 8)
|
4 (3 - 6)
|
4 (2 - 10)
|
3 (1 - 6)
|
2 (0 - 5)
|
2 (0 - 3)
|
3 (1 - 6)
|
4 (2 - 10)
|
2 (1 - 5)
|
|
Tophaceous disease
|
333 (43.4)
|
67 (39.9)
|
320 (57.6)
|
46 (62.2)
|
138 (16.8)
|
5 (7.7)
|
34 (14.9)
|
54 (22.6)
|
174 (99.4)
|
26 (23.9)
|
144 (41.3)
|
26 (28.3)
|
9 (12.2)
|
4 (19.0)
|
177 (44.6)
|
14 (28.6)
|
|
Gout PRS
|
4.1 ± 0.7
|
4.0 ± 0.6
|
4.0 ± 0.6
|
4.0 ± 0.6
|
4.1 ± 0.7
|
4.1 ± 0.6
|
4.2 ± 0.7
|
4.2 ± 0.6
|
4.2 ± 0.6
|
4.1 ± 0.6
|
4.4 ± 0.5
|
4.4 ± 0.5
|
4.2 ± 0.5
|
4.4 ± 0.5
|
4.8 ± 0.6
|
4.7 ± 0.6
|
|
Urate PRS
|
3.8 ± 0.3
|
3.8 ± 0.3
|
3.8 ± 0.3
|
3.7 ± 0.3
|
3.8 ± 0.3
|
3.8 ± 0.3
|
3.8 ± 0.3
|
3.8 ± 0.3
|
3.9 ± 0.3
|
3.8 ± 0.3
|
3.7 ± 0.3
|
3.7 ± 0.3
|
3.7 ± 0.3
|
3.8 ± 0.3
|
3.8 ± 0.3
|
3.9 ± 0.4
|
|
Note:
|
|
Flares reported as median (inter-quartile range). All other numeric variables reported as mean ± sd. All binary variables reported as N (%).
|
Supplementary figures and tables
Manhattan plot for the gout GWAS showing the locations of the SNPs that were used in the PRS.
# Loading in summary statistics file.
load(path(scratch_path, "Output/sumstat_final.RData"))
# Filtering summary stats to keep only SNPs with P < 0.01.
GOUT_pValues <- sumstat_final %>%
filter(P < 0.01) %>%
arrange(CHR, BP)
# Extracting lead SNPs for labeling.
tmp <- GOUT_pValues %>%
filter(RSID %in% UKBB_Gene_OR$RSID) %>%
mutate("Gene" = UKBB_Gene_OR$Locus_Name)
# Extracting all non-lead SNPs for labeling.
tmp2 <- GOUT_pValues %>%
filter(!(SNP %in% tmp$SNP)) %>%
mutate("Gene" = NA)
# Joining back together.
GOUT_pValues <- full_join(tmp, tmp2) %>%
arrange(CHR, BP)
# Preparing data for manhattan plot.
GOUT_pValues2 <- GOUT_pValues %>%
# Grouping by chromosome.
group_by(CHR) %>%
# Computing chromosome size as new column.
summarise(chr_len = max(BP)) %>%
# Calculating cumulative position of each chromosome.
mutate(tot = cumsum(chr_len) - chr_len) %>%
# Removing chromosome length column.
select(-chr_len) %>%
# Adding this info to the initial dataset.
left_join(GOUT_pValues, ., by = "CHR") %>%
# Arranging by chr/bp.
arrange(CHR, BP) %>%
# Adding a cumulative position of each SNP.
mutate(BPcum = BP + tot) %>%
# Adding highlight and annotation information.
mutate(is_highlight = ifelse(RSID %in% UKBB_Gene_OR$RSID, "yes", "no")) %>%
mutate(is_annotate = ifelse(RSID %in% UKBB_Gene_OR$RSID, "yes", "no"))
# Marking center of each chromosome in axisdf object.
axisdf <- GOUT_pValues2 %>%
group_by(CHR) %>%
summarize(center = (max(BPcum) + min(BPcum)) / 2)
# Plotting the manhattan plot, with the cumulative SNP location on the x-axis and -log10(P) on the y-axis.
ggplot(GOUT_pValues2, aes(x = BPcum, y = -log10(P))) +
# Plotting each SNP, colored based on the CHR.
geom_point(aes(color = as.factor(CHR)), alpha = 0.8, size = 1.3) +
# Setting colors of chromosomes.
scale_color_manual(values = rep(c("#1e6b52", "#aa9767"), 22)) +
# Adding significance threshold.
geom_hline(yintercept = -log10(5e-8), colour = "red") +
# Customizing x-axis based on CHR center points.
scale_x_continuous(label = axisdf$CHR, breaks = axisdf$center) +
# Removing space between plot area and x-axis.
scale_y_continuous(expand = c(0, 0), limits = c(0, 250)) +
# Adding highlighted data points.
geom_point(data = subset(GOUT_pValues2, is_highlight == "yes"), color = "orange", size = 2) +
# Adding label using ggrepel to avoid overlapping.
geom_label_repel(aes(label = Gene), size = 3, box.padding = 0.5, force_pull = 0.5, nudge_y = 3, max.overlaps = Inf) +
# Adding x-axis label.
xlab("Chromosome") +
# Adding title to plot.
ggtitle("UK Biobank Gout GWAS Results") +
# Setting theme to bw setting.
theme_bw() +
# Further customizing the theme.
theme(
legend.position = "none",
panel.border = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()
)
# Saving the image (must be done on biocmerriserver2).
ggsave(filename = here("Output/Plots/Manhattan.tiff"), width = 10, height = 5, dpi = 600)
Preparing table of GWAS results for PRS.
# Making pooled West Polynesian table.
WestPoly <- rbind(data_list[[31]], data_list[[32]], data_list[[33]], data_list[[34]])
# Making pooled East Polynesian table.
EastPoly <- rbind(data_list[[23]], data_list[[24]], data_list[[25]], data_list[[26]], data_list[[27]], data_list[[28]], data_list[[29]], data_list[[30]])
# Pulling out list of SNPs.
SNPlist <- UKBB_Gene_OR$RSID
# Making function for extracting the minor allele frequency for a set of SNPs.
get_maf <- function(cohort, snps){
# Making test dataframe which is each SNP as a factorized column.
test <- cohort %>%
select(all_of(snps)) %>%
mutate_all(factor, levels = 0:2) %>%
na.omit()
# Making empty vector.
tmp <- c()
# For each column in test.
for(i in 1:ncol(test)){
# Calculating the allele frequency for the effect allele (1 encoded).
tmp[i] <- ((sum(test[[i]] == 1) + (sum(test[[i]] == 2) * 2)) / (2 * nrow(test))) %>% sprintf(fmt = "%#.3f")
}
# Output the tmp vector.
return(tmp)
}
# Calculating EAF for West Polynesian cohort.
tmp1 <- get_maf(cohort = WestPoly, snps = SNPlist)
# Calculating EAF for East Polynesian cohort.
tmp2 <- get_maf(cohort = EastPoly, snps = SNPlist)
# Preparing lead SNP table for supplementary table format.
tmp <- UKBB_Gene_OR %>%
mutate(`Locus (chr:lower-upper)` = paste0(CHR, ":", BP1, "-", BP2),
`Base Pair Position` = str_remove_all(format(BP, big.mark = ","), " "),
`Odds ratio` = sprintf(OR, fmt = "%#.2f"),
`95%-CI` = paste0("[", sprintf(L95, fmt = "%#.2f"), ", ", sprintf(U95, fmt = "%#.2f"), "]"),
`Odds ratio (Marginal)` = str_replace(sprintf(OR_old, fmt = "%#.2f"), "NA", NA_character_),
`95%-CI (Marginal)` = case_when(!is.na(L95_old) ~ paste0("[", sprintf(L95_old, fmt = "%#.2f"), ", ", sprintf(U95_old, fmt = "%#.2f"), "]")),
`EAF (UK Biobank)` = sprintf(EAF, fmt = "%#.3f"),
`EAF (East Polynesian)` = tmp2,
`EAF (West Polynesian)` = tmp1,
`p-value` = formatC(P, format = "e", digits = 2),
`p-value (Marginal)` = formatC(P_old, format = "e", digits = 2)) %>%
rename(`Lead SNP rsID` = RSID,
`Effect allele` = Effect_Allele,
`Alternate allele` = Alternate_Allele,
`Locus name` = Locus_Name) %>%
select(`Locus (chr:lower-upper)`, `Lead SNP rsID`, `Base Pair Position`, `Effect allele`, `Alternate allele`, `Odds ratio`, `95%-CI`, `p-value`, `Locus name`, `EAF (UK Biobank)`, `EAF (East Polynesian)`, `EAF (West Polynesian)`, `Odds ratio (Marginal)`, `95%-CI (Marginal)`, `p-value (Marginal)`)
# Writing out supplementary table 1.
write_delim(tmp, file = here("Output/Tables/UKBB_Gene_OR.txt"), delim = "\t")
# Preparing lead SNP of Tin Urate PRS table for supplementary table format.
tmp <- Tin_Gene_OR %>%
mutate(`Locus (chr:lower-upper)` = paste0(CHR, ":", BP1, "-", BP2),
`Base Pair Position` = str_remove_all(format(BP, big.mark = ","), " "),
Beta = sprintf(Beta, fmt = "%#.3f"),
`95%-CI` = paste0("[", sprintf(L95, fmt = "%#.3f"), ", ", sprintf(U95, fmt = "%#.3f"), "]"),
`p-value` = formatC(P, format = "e", digits = 2)) %>%
rename(`Lead SNP rsID` = RSID,
`Effect allele` = Effect_Allele,
`Alternate allele` = Alternate_Allele,
`Locus name` = Locus_Name) %>%
select(`Locus (chr:lower-upper)`, `Lead SNP rsID`, `Base Pair Position`, `Effect allele`, `Alternate allele`, Beta, `95%-CI`, `p-value`, `Locus name`)
# Writing out supplementary table 1.
write_delim(tmp, file = here("Output/Tables/Tin_Gene_OR.txt"), delim = "\t")
Producing missing data supplementary table.
# Producing table of missing data statistics.
table1 <- tibble("Cohort" = cohortstring,
"N" = unlist(lapply(data_list, function(x) format(nrow(x), big.mark = ","))),
"Age at Collection" = unlist(lapply(data_list, function(x) missing(x$AGECOL))),
"Serum Urate" = unlist(lapply(data_list, function(x) missing(x$URATE))),
"ULT" = unlist(lapply(data_list, function(x) missing(x$ULT))),
"Age at Onset" = unlist(lapply(data_list, function(x) missing(x$AGE1ATK))),
"Disease Duration" = unlist(lapply(data_list, function(x) missing(x$DURATION))),
"Flares" = unlist(lapply(data_list, function(x) missing(x$NUMATK))),
"Tophi" = unlist(lapply(data_list, function(x) missing(x$TOPHIGOUT))),
"PRS" = unlist(lapply(data_list, function(x) missing(x$PRS))),
"Prophylaxis" = unlist(lapply(data_list, function(x) missing(x$PROPHY))),
"BMI" = unlist(lapply(data_list, function(x) missing(x$BMI))),
"Hypertension" = unlist(lapply(data_list, function(x) missing(x$HYPERTENSION))),
"Type 2 Diabetes" = unlist(lapply(data_list, function(x) missing(x$DIABETES))),
"Heart Disease" = unlist(lapply(data_list, function(x) missing(x$HEART))),
"Kidney Disease" = unlist(lapply(data_list, function(x) missing(x$KIDNEY))),
"Dyslipidemia" = unlist(lapply(data_list, function(x) missing(x$LIPIDS))),
"Stroke" = unlist(lapply(data_list, function(x) missing(x$STROKE))),
"Alcoholic Drinks / Week" = unlist(lapply(data_list, function(x) missing(x$TOTALALC))),
"Sugar-Sweetened Drinks / Week" = unlist(lapply(data_list, function(x) missing(x$SUGDRINK))),
"Current Smoker" = unlist(lapply(data_list, function(x) missing(x$CURSMOKE))),
"Family History of Gout" = unlist(lapply(data_list, function(x) missing(x$FAMGOUT))),
"No. Relatives w/ Gout" = unlist(lapply(data_list, function(x) missing(x$FAMGOUTNUM))))
# Transposing dataframe.
table1 <- transpose_df(table1) %>%
column_to_rownames(var = "Cohort")
# Writing out supplementary table.
write_delim(table1, file = here("Output/Tables/missing.txt"), delim = "\t")
# Preparing for printing in Rmd.
table1 <- table1 %>%
mutate(across(.cols = 1:ncol(table1), ~ str_replace(string = .x, pattern = " ", replacement = " ")))
# Preparing for printing in Rmd.
row.names(table1) <- str_replace(row.names(table1), " ", " ")
# Printing table.
table1 %>%
kable(col.names = clean_names,
align = "c",
escape = F) %>%
kable_styling("striped") %>%
scroll_box(width = "900px", height = "475px") %>%
footnote("'All' = all missing, 'None' = none missing")
|
|
UK Biobank Gout Male
|
UK Biobank Gout Female
|
UK Biobank Control Male
|
UK Biobank Control Female
|
Aus/NZ European Gout Male
|
Aus/NZ European Gout Female
|
Aus/NZ European Control Male
|
Aus/NZ European Control Female
|
GlobalGout Gout Male
|
GlobalGout Gout Female
|
GlobalGout Control Male
|
GlobalGout Control Female
|
Ardea LASSO Gout Male
|
Ardea LASSO Gout Female
|
Ardea CLEAR1 Gout Male
|
Ardea CLEAR1 Gout Female
|
Ardea CLEAR2 Gout Male
|
Ardea CLEAR2 Gout Female
|
Ardea CRYSTAL Gout Male
|
Ardea CRYSTAL Gout Female
|
Ardea LIGHT Gout Male
|
Ardea LIGHT Gout Female
|
East Polynesian Gout Male
|
East Polynesian Gout Female
|
East Polynesian Control Male
|
East Polynesian Control Female
|
East Polynesian Gout Male NP
|
East Polynesian Gout Female NP
|
East Polynesian Control Male NP
|
East Polynesian Control Female NP
|
West Polynesian Gout Male
|
West Polynesian Gout Female
|
West Polynesian Control Male
|
West Polynesian Control Female
|
|
N
|
6,210
|
522
|
143,574
|
162,244
|
978
|
210
|
764
|
619
|
1,032
|
124
|
44
|
79
|
819
|
65
|
230
|
16
|
239
|
7
|
175
|
4
|
109
|
8
|
408
|
122
|
241
|
382
|
124
|
28
|
44
|
37
|
436
|
54
|
199
|
180
|
|
Age at Collection
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
|
Serum Urate
|
287 (4.6)
|
29 (5.6)
|
6,684 (4.7)
|
7,869 (4.9)
|
18 (1.8)
|
4 (1.9)
|
36 (4.7)
|
24 (3.9)
|
50 (4.8)
|
8 (6.5)
|
32 (72.7)
|
71 (89.9)
|
4 (0.5)
|
1 (1.5)
|
1 (0.4)
|
None
|
1 (0.4)
|
None
|
None
|
None
|
None
|
None
|
2 (0.5)
|
1 (0.8)
|
35 (14.5)
|
64 (16.8)
|
None
|
1 (3.6)
|
None
|
None
|
1 (0.2)
|
1 (1.9)
|
21 (10.6)
|
16 (8.9)
|
|
ULT
|
None
|
None
|
All
|
All
|
394 (40.3)
|
97 (46.2)
|
All
|
All
|
279 (27.0)
|
42 (33.9)
|
All
|
All
|
3 (0.4)
|
1 (1.5)
|
None
|
None
|
None
|
None
|
81 (46.3)
|
3 (75.0)
|
None
|
None
|
124 (30.4)
|
32 (26.2)
|
All
|
All
|
21 (16.9)
|
2 (7.1)
|
All
|
All
|
126 (28.9)
|
14 (25.9)
|
All
|
All
|
|
Age at Onset
|
All
|
All
|
All
|
All
|
51 (5.2)
|
24 (11.4)
|
All
|
All
|
25 (2.4)
|
6 (4.8)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
27 (6.6)
|
9 (7.4)
|
All
|
All
|
1 (0.8)
|
None
|
All
|
All
|
18 (4.1)
|
9 (16.7)
|
All
|
All
|
|
Disease Duration
|
All
|
All
|
All
|
All
|
51 (5.2)
|
24 (11.4)
|
All
|
All
|
25 (2.4)
|
6 (4.8)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
27 (6.6)
|
9 (7.4)
|
All
|
All
|
1 (0.8)
|
None
|
All
|
All
|
18 (4.1)
|
9 (16.7)
|
All
|
All
|
|
Flares
|
All
|
All
|
All
|
All
|
119 (12.2)
|
34 (16.2)
|
All
|
All
|
72 (7.0)
|
6 (4.8)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
32 (7.8)
|
14 (11.5)
|
All
|
All
|
3 (2.4)
|
2 (7.1)
|
All
|
All
|
20 (4.6)
|
9 (16.7)
|
All
|
All
|
|
Tophi
|
All
|
All
|
All
|
All
|
210 (21.5)
|
42 (20.0)
|
All
|
All
|
476 (46.1)
|
50 (40.3)
|
All
|
All
|
None
|
None
|
2 (0.9)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
59 (14.5)
|
30 (24.6)
|
All
|
All
|
50 (40.3)
|
7 (25.0)
|
All
|
All
|
39 (8.9)
|
5 (9.3)
|
All
|
All
|
|
PRS
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
|
Prophylaxis
|
All
|
All
|
All
|
All
|
918 (93.9)
|
196 (93.3)
|
763 (99.9)
|
All
|
317 (30.7)
|
43 (34.7)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
348 (85.3)
|
108 (88.5)
|
240 (99.6)
|
All
|
None
|
2 (7.1)
|
43 (97.7)
|
All
|
390 (89.4)
|
45 (83.3)
|
All
|
179 (99.4)
|
|
BMI
|
22 (0.4)
|
3 (0.6)
|
454 (0.3)
|
463 (0.3)
|
58 (5.9)
|
18 (8.6)
|
174 (22.8)
|
53 (8.6)
|
44 (4.3)
|
2 (1.6)
|
All
|
All
|
5 (0.6)
|
None
|
1 (0.4)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
7 (1.7)
|
4 (3.3)
|
3 (1.2)
|
13 (3.4)
|
2 (1.6)
|
None
|
4 (9.1)
|
6 (16.2)
|
10 (2.3)
|
3 (5.6)
|
3 (1.5)
|
8 (4.4)
|
|
Hypertension
|
None
|
None
|
None
|
None
|
383 (39.2)
|
36 (17.1)
|
486 (63.6)
|
322 (52.0)
|
365 (35.4)
|
35 (28.2)
|
All
|
All
|
None
|
None
|
1 (0.4)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
120 (29.4)
|
14 (11.5)
|
171 (71.0)
|
252 (66.0)
|
30 (24.2)
|
3 (10.7)
|
32 (72.7)
|
28 (75.7)
|
196 (45.0)
|
13 (24.1)
|
158 (79.4)
|
135 (75.0)
|
|
Type 2 Diabetes
|
636 (10.2)
|
53 (10.2)
|
15,566 (10.8)
|
20,793 (12.8)
|
72 (7.4)
|
17 (8.1)
|
312 (40.8)
|
266 (43.0)
|
181 (17.5)
|
29 (23.4)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
11 (2.7)
|
1 (0.8)
|
9 (3.7)
|
18 (4.7)
|
86 (69.4)
|
18 (64.3)
|
37 (84.1)
|
36 (97.3)
|
15 (3.4)
|
None
|
6 (3.0)
|
8 (4.4)
|
|
Heart Disease
|
None
|
None
|
None
|
None
|
209 (21.4)
|
32 (15.2)
|
379 (49.6)
|
292 (47.2)
|
518 (50.2)
|
52 (41.9)
|
All
|
All
|
None
|
None
|
1 (0.4)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
45 (11.0)
|
11 (9.0)
|
28 (11.6)
|
60 (15.7)
|
46 (37.1)
|
7 (25.0)
|
29 (65.9)
|
24 (64.9)
|
34 (7.8)
|
4 (7.4)
|
46 (23.1)
|
13 (7.2)
|
|
Kidney Disease
|
277 (4.5)
|
25 (4.8)
|
6,530 (4.5)
|
7,657 (4.7)
|
222 (22.7)
|
43 (20.5)
|
383 (50.1)
|
261 (42.2)
|
512 (49.6)
|
44 (35.5)
|
All
|
All
|
5 (0.6)
|
None
|
2 (0.9)
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
79 (19.4)
|
25 (20.5)
|
191 (79.3)
|
308 (80.6)
|
94 (75.8)
|
16 (57.1)
|
38 (86.4)
|
29 (78.4)
|
73 (16.7)
|
9 (16.7)
|
160 (80.4)
|
141 (78.3)
|
|
Dyslipidemia
|
None
|
None
|
None
|
None
|
376 (38.4)
|
71 (33.8)
|
438 (57.3)
|
214 (34.6)
|
255 (24.7)
|
35 (28.2)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
1 (0.6)
|
None
|
None
|
None
|
139 (34.1)
|
39 (32.0)
|
171 (71.0)
|
269 (70.4)
|
53 (42.7)
|
7 (25.0)
|
28 (63.6)
|
29 (78.4)
|
150 (34.4)
|
14 (25.9)
|
147 (73.9)
|
115 (63.9)
|
|
Stroke
|
None
|
None
|
None
|
None
|
298 (30.5)
|
62 (29.5)
|
267 (34.9)
|
90 (14.5)
|
603 (58.4)
|
73 (58.9)
|
All
|
All
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
None
|
63 (15.4)
|
16 (13.1)
|
32 (13.3)
|
49 (12.8)
|
66 (53.2)
|
11 (39.3)
|
31 (70.5)
|
26 (70.3)
|
48 (11.0)
|
5 (9.3)
|
44 (22.1)
|
14 (7.8)
|
|
Alcoholic Drinks / Week
|
793 (12.8)
|
240 (46.0)
|
29,263 (20.4)
|
56,288 (34.7)
|
None
|
None
|
123 (16.1)
|
202 (32.6)
|
509 (49.3)
|
57 (46.0)
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
34 (8.3)
|
3 (2.5)
|
2 (0.8)
|
7 (1.8)
|
None
|
None
|
None
|
None
|
33 (7.6)
|
2 (3.7)
|
3 (1.5)
|
1 (0.6)
|
|
Sugar-Sweetened Drinks / Week
|
All
|
All
|
All
|
All
|
106 (10.8)
|
27 (12.9)
|
172 (22.5)
|
51 (8.2)
|
762 (73.8)
|
94 (75.8)
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
75 (18.4)
|
17 (13.9)
|
6 (2.5)
|
10 (2.6)
|
None
|
None
|
None
|
None
|
87 (20.0)
|
4 (7.4)
|
4 (2.0)
|
1 (0.6)
|
|
Current Smoker
|
None
|
None
|
None
|
None
|
463 (47.3)
|
124 (59.0)
|
262 (34.3)
|
142 (22.9)
|
464 (45.0)
|
47 (37.9)
|
All
|
All
|
All
|
All
|
2 (0.9)
|
None
|
None
|
None
|
None
|
None
|
3 (2.8)
|
None
|
211 (51.7)
|
69 (56.6)
|
100 (41.5)
|
190 (49.7)
|
57 (46.0)
|
9 (32.1)
|
29 (65.9)
|
25 (67.6)
|
312 (71.6)
|
21 (38.9)
|
92 (46.2)
|
69 (38.3)
|
|
Family History of Gout
|
All
|
All
|
All
|
All
|
71 (7.3)
|
21 (10.0)
|
408 (53.4)
|
298 (48.1)
|
244 (23.6)
|
19 (15.3)
|
35 (79.5)
|
71 (89.9)
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
56 (13.7)
|
15 (12.3)
|
42 (17.4)
|
58 (15.2)
|
8 (6.5)
|
2 (7.1)
|
3 (6.8)
|
4 (10.8)
|
54 (12.4)
|
2 (3.7)
|
26 (13.1)
|
23 (12.8)
|
|
No. Relatives w/ Gout
|
All
|
All
|
All
|
All
|
282 (28.8)
|
64 (30.5)
|
531 (69.5)
|
459 (74.2)
|
643 (62.3)
|
84 (67.7)
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
All
|
104 (25.5)
|
30 (24.6)
|
65 (27.0)
|
88 (23.0)
|
36 (29.0)
|
8 (28.6)
|
17 (38.6)
|
10 (27.0)
|
84 (19.3)
|
13 (24.1)
|
45 (22.6)
|
46 (25.6)
|
|
Note:
|
|
‘All’ = all missing, ‘None’ = none missing
|
Producing full cohort statistics supplementary table.
# Producing table of cohort statistics.
table1 <- tibble("Cohort" = cohortstring,
"N" = unlist(lapply(data_list, function(x) format(nrow(x), big.mark = ","))),
"Age at Collection (years)" = unlist(lapply(data_list, function(x) report(x$AGECOL))),
"Serum Urate (mg/dL)" = unlist(lapply(data_list, function(x) report(x$URATE))),
"ULT" = unlist(lapply(data_list, function(x) sumreport(x$ULT))),
"Age at Onset (years)" = unlist(lapply(data_list, function(x) report(x$AGE1ATK))),
"Disease Duration (years)" = unlist(lapply(data_list, function(x) report(x$DURATION))),
"Number of Flares in Last Year" = unlist(lapply(data_list, function(x) report_median(x$NUMATK))),
"Presence of Tophi" = unlist(lapply(data_list, function(x) sumreport(x$TOPHIGOUT))),
"PRS" = unlist(lapply(data_list, function(x) report(x$PRS))),
"Prophylaxis" = unlist(lapply(data_list, function(x) sumreport(x$PROPHY))),
"BMI" = unlist(lapply(data_list, function(x) report(x$BMI))),
"Hypertension" = unlist(lapply(data_list, function(x) sumreport(x$HYPERTENSION))),
"Type 2 Diabetes" = unlist(lapply(data_list, function(x) sumreport(x$DIABETES))),
"Heart Disease" = unlist(lapply(data_list, function(x) sumreport(x$HEART))),
"Kidney Disease" = unlist(lapply(data_list, function(x) sumreport(x$KIDNEY))),
"Dyslipidemia" = unlist(lapply(data_list, function(x) sumreport(x$LIPIDS))),
"Stroke" = unlist(lapply(data_list, function(x) sumreport(x$STROKE))),
"Alcoholic Drinks / Week" = unlist(lapply(data_list, function(x) report(x$TOTALALC))),
"Sugar-Sweetened Drinks / Week" = unlist(lapply(data_list, function(x) report(x$SUGDRINK))),
"Current Smoker" = unlist(lapply(data_list, function(x) sumreport(x$CURSMOKE))),
"Family History of Gout" = unlist(lapply(data_list, function(x) sumreport(x$FAMGOUT))),
"No. Relatives w/ Gout" = unlist(lapply(data_list, function(x) report(x$FAMGOUTNUM))))
# Transposing dataframe.
table1 <- transpose_df(table1) %>%
column_to_rownames(var = "Cohort")
# Writing out table.
write_delim(table1, file = here("Output/Tables/demographics.txt"), delim = "\t")
# Preparing for printing.
table1 <- table1 %>%
mutate(across(.cols = 1:ncol(table1), ~ str_replace(string = .x, pattern = " ", replacement = " ")))
# Preparing for printing.
row.names(table1) <- str_replace(row.names(table1), " ", " ")
# Printing table.
table1 %>%
kable(col.names = clean_names,
align = "c",
escape = F) %>%
kable_styling("striped") %>%
scroll_box(width = "900px", height = "475px") %>%
footnote("Flares reported as median (inter-quartile range). All other numeric variables reported as mean ± sd. All binary variables reported as N (%).")
|
|
UK Biobank Gout Male
|
UK Biobank Gout Female
|
UK Biobank Control Male
|
UK Biobank Control Female
|
Aus/NZ European Gout Male
|
Aus/NZ European Gout Female
|
Aus/NZ European Control Male
|
Aus/NZ European Control Female
|
GlobalGout Gout Male
|
GlobalGout Gout Female
|
GlobalGout Control Male
|
GlobalGout Control Female
|
Ardea LASSO Gout Male
|
Ardea LASSO Gout Female
|
Ardea CLEAR1 Gout Male
|
Ardea CLEAR1 Gout Female
|
Ardea CLEAR2 Gout Male
|
Ardea CLEAR2 Gout Female
|
Ardea CRYSTAL Gout Male
|
Ardea CRYSTAL Gout Female
|
Ardea LIGHT Gout Male
|
Ardea LIGHT Gout Female
|
East Polynesian Gout Male
|
East Polynesian Gout Female
|
East Polynesian Control Male
|
East Polynesian Control Female
|
East Polynesian Gout Male NP
|
East Polynesian Gout Female NP
|
East Polynesian Control Male NP
|
East Polynesian Control Female NP
|
West Polynesian Gout Male
|
West Polynesian Gout Female
|
West Polynesian Control Male
|
West Polynesian Control Female
|
|
N
|
6,210
|
522
|
143,574
|
162,244
|
978
|
210
|
764
|
619
|
1,032
|
124
|
44
|
79
|
819
|
65
|
230
|
16
|
239
|
7
|
175
|
4
|
109
|
8
|
408
|
122
|
241
|
382
|
124
|
28
|
44
|
37
|
436
|
54
|
199
|
180
|
|
Age at Collection (years)
|
59.7 ± 7.0
|
61.8 ± 6.0
|
57.2 ± 8.1
|
57.0 ± 7.8
|
62.4 ± 12.4
|
70.0 ± 12.7
|
54.9 ± 17.1
|
51.3 ± 17.3
|
60.1 ± 13.1
|
67.6 ± 11.1
|
60.0 ± 14.8
|
64.2 ± 11.5
|
51.4 ± 11.8
|
60.7 ± 10.6
|
52.3 ± 11.1
|
61.4 ± 7.4
|
53.0 ± 10.8
|
55.0 ± 16.0
|
53.9 ± 11.0
|
63.8 ± 5.4
|
53.3 ± 11.8
|
64.0 ± 16.1
|
54.3 ± 12.4
|
60.7 ± 11.7
|
43.9 ± 15.6
|
45.6 ± 14.8
|
59.7 ± 11.3
|
59.1 ± 13.3
|
49.8 ± 13.5
|
48.7 ± 17.1
|
47.5 ± 12.3
|
53.4 ± 13.4
|
39.3 ± 15.0
|
40.5 ± 15.4
|
|
Serum Urate (mg/dL)
|
6.4 ± 1.7
|
5.5 ± 1.7
|
5.9 ± 1.2
|
4.5 ± 1.1
|
6.7 ± 1.9
|
6.4 ± 2.3
|
6.4 ± 1.7
|
4.6 ± 1.6
|
7.4 ± 2.3
|
7.7 ± 2.6
|
6.6 ± 1.7
|
6.6 ± 1.7
|
8.9 ± 1.3
|
8.9 ± 1.4
|
7.9 ± 1.4
|
8.1 ± 1.2
|
7.9 ± 1.5
|
8.4 ± 2.0
|
8.8 ± 1.5
|
10.1 ± 1.2
|
9.3 ± 1.7
|
8.1 ± 1.4
|
7.0 ± 2.3
|
6.3 ± 2.5
|
6.5 ± 1.9
|
5.4 ± 1.5
|
7.0 ± 1.7
|
6.9 ± 2.4
|
6.3 ± 1.3
|
5.3 ± 1.3
|
7.7 ± 2.1
|
7.0 ± 2.7
|
6.7 ± 1.6
|
5.4 ± 1.7
|
|
ULT
|
4499 (72.4)
|
372 (71.3)
|
NA
|
NA
|
564 (96.6)
|
108 (95.6)
|
NA
|
NA
|
570 (75.7)
|
49 (59.8)
|
NA
|
NA
|
255 (31.2)
|
26 (40.6)
|
230 (100.0)
|
16 (100.0)
|
239 (100.0)
|
7 (100.0)
|
94 (100.0)
|
1 (100.0)
|
109 (100.0)
|
8 (100.0)
|
262 (92.3)
|
86 (95.6)
|
NA
|
NA
|
96 (93.2)
|
19 (73.1)
|
NA
|
NA
|
292 (94.2)
|
38 (95.0)
|
NA
|
NA
|
|
Age at Onset (years)
|
NA
|
NA
|
NA
|
NA
|
46.4 ± 15.8
|
59.5 ± 15.7
|
NA
|
NA
|
46.5 ± 14.0
|
57.8 ± 12.5
|
NA
|
NA
|
41.4 ± 13.4
|
55.1 ± 12.0
|
41.9 ± 12.4
|
55.2 ± 11.2
|
42.6 ± 13.2
|
46.1 ± 20.6
|
40.1 ± 13.0
|
61.5 ± 6.2
|
42.4 ± 13.1
|
55.2 ± 17.3
|
37.9 ± 14.0
|
49.4 ± 15.4
|
NA
|
NA
|
39.1 ± 15.2
|
46.0 ± 16.8
|
NA
|
NA
|
34.6 ± 12.0
|
44.3 ± 15.0
|
NA
|
NA
|
|
Disease Duration (years)
|
NA
|
NA
|
NA
|
NA
|
16.8 ± 12.7
|
10.9 ± 10.4
|
NA
|
NA
|
14.5 ± 11.4
|
10.6 ± 9.8
|
NA
|
NA
|
11.0 ± 9.4
|
6.6 ± 8.0
|
11.4 ± 9.4
|
7.1 ± 9.5
|
11.4 ± 9.8
|
9.9 ± 11.6
|
14.8 ± 10.0
|
3.2 ± 1.0
|
11.9 ± 8.7
|
9.8 ± 11.0
|
17.2 ± 12.8
|
13.1 ± 13.2
|
NA
|
NA
|
21.7 ± 15.3
|
14.1 ± 12.6
|
NA
|
NA
|
13.6 ± 10.3
|
9.2 ± 9.2
|
NA
|
NA
|
|
Number of Flares in Last Year
|
NA
|
NA
|
NA
|
NA
|
2 (0 - 4)
|
1.5 (0 - 3.25)
|
NA
|
NA
|
2 (1 - 4)
|
2.5 (1 - 4)
|
NA
|
NA
|
4 (3 - 8)
|
3 (3 - 6)
|
3 (2 - 6)
|
3 (3 - 4)
|
4 (2 - 8)
|
5 (3 - 6)
|
4 (3 - 6)
|
4.5 (2.25 - 6)
|
4 (2 - 10)
|
4 (2.75 - 5.25)
|
3 (1 - 6)
|
2 (0 - 5)
|
NA
|
NA
|
2 (0 - 3)
|
3 (1 - 6)
|
NA
|
NA
|
4 (2 - 10)
|
2 (1 - 5)
|
NA
|
NA
|
|
Presence of Tophi
|
NA
|
NA
|
NA
|
NA
|
333 (43.4)
|
67 (39.9)
|
NA
|
NA
|
320 (57.6)
|
46 (62.2)
|
NA
|
NA
|
138 (16.8)
|
5 (7.7)
|
34 (14.9)
|
1 (6.2)
|
54 (22.6)
|
2 (28.6)
|
174 (99.4)
|
4 (100.0)
|
26 (23.9)
|
5 (62.5)
|
144 (41.3)
|
26 (28.3)
|
NA
|
NA
|
9 (12.2)
|
4 (19.0)
|
NA
|
NA
|
177 (44.6)
|
14 (28.6)
|
NA
|
NA
|
|
PRS
|
4.1 ± 0.6
|
4.0 ± 0.7
|
3.7 ± 0.6
|
3.7 ± 0.6
|
4.1 ± 0.7
|
4.0 ± 0.6
|
3.7 ± 0.6
|
3.7 ± 0.6
|
4.0 ± 0.6
|
4.0 ± 0.6
|
3.8 ± 0.6
|
3.8 ± 0.6
|
4.1 ± 0.7
|
4.1 ± 0.6
|
4.2 ± 0.7
|
4.3 ± 0.6
|
4.2 ± 0.6
|
4.3 ± 0.8
|
4.2 ± 0.6
|
4.0 ± 0.4
|
4.1 ± 0.6
|
4.2 ± 0.2
|
4.4 ± 0.5
|
4.4 ± 0.5
|
4.2 ± 0.4
|
4.2 ± 0.5
|
4.2 ± 0.5
|
4.4 ± 0.5
|
4.2 ± 0.5
|
4.1 ± 0.6
|
4.8 ± 0.6
|
4.7 ± 0.6
|
4.3 ± 0.6
|
4.3 ± 0.6
|
|
Prophylaxis
|
NA
|
NA
|
NA
|
NA
|
56 (93.3)
|
13 (92.9)
|
0 (0.0)
|
NA
|
443 (62.0)
|
55 (67.9)
|
NA
|
NA
|
810 (98.9)
|
65 (100.0)
|
230 (100.0)
|
16 (100.0)
|
239 (100.0)
|
7 (100.0)
|
175 (100.0)
|
4 (100.0)
|
109 (100.0)
|
8 (100.0)
|
59 (98.3)
|
13 (92.9)
|
1 (100.0)
|
NA
|
112 (90.3)
|
23 (88.5)
|
1 (100.0)
|
NA
|
46 (100.0)
|
7 (77.8)
|
NA
|
0 (0.0)
|
|
BMI
|
30.5 ± 4.8
|
32.1 ± 6.4
|
27.6 ± 4.1
|
26.8 ± 5.0
|
30.3 ± 5.2
|
30.9 ± 7.3
|
27.2 ± 4.7
|
27.0 ± 6.2
|
29.4 ± 4.7
|
30.9 ± 6.7
|
NA
|
NA
|
34.1 ± 6.7
|
38.0 ± 10.3
|
34.6 ± 6.1
|
38.1 ± 6.5
|
33.7 ± 6.0
|
36.2 ± 7.5
|
32.2 ± 5.4
|
36.5 ± 3.8
|
31.3 ± 4.9
|
35.7 ± 8.7
|
35.4 ± 8.0
|
38.2 ± 9.8
|
31.9 ± 7.1
|
32.7 ± 8.5
|
35.9 ± 7.7
|
39.5 ± 7.6
|
32.5 ± 7.8
|
29.1 ± 6.1
|
36.1 ± 6.7
|
38.5 ± 9.1
|
33.1 ± 6.2
|
34.3 ± 7.7
|
|
Hypertension
|
4242 (68.3)
|
413 (79.1)
|
56792 (39.6)
|
48708 (30.0)
|
573 (96.3)
|
172 (98.9)
|
169 (60.8)
|
121 (40.7)
|
662 (99.3)
|
89 (100.0)
|
NA
|
NA
|
401 (49.0)
|
48 (73.8)
|
146 (63.8)
|
15 (93.8)
|
166 (69.5)
|
6 (85.7)
|
101 (57.7)
|
4 (100.0)
|
57 (52.3)
|
8 (100.0)
|
267 (92.7)
|
107 (99.1)
|
69 (98.6)
|
119 (91.5)
|
94 (100.0)
|
25 (100.0)
|
12 (100.0)
|
9 (100.0)
|
212 (88.3)
|
39 (95.1)
|
36 (87.8)
|
43 (95.6)
|
|
Type 2 Diabetes
|
1090 (19.6)
|
119 (25.4)
|
11226 (8.8)
|
6729 (4.8)
|
144 (15.9)
|
51 (26.4)
|
55 (12.2)
|
48 (13.6)
|
350 (41.1)
|
53 (55.8)
|
NA
|
NA
|
79 (9.6)
|
16 (24.6)
|
30 (13.0)
|
6 (37.5)
|
32 (13.4)
|
0 (0.0)
|
24 (13.7)
|
2 (50.0)
|
12 (11.0)
|
0 (0.0)
|
121 (30.5)
|
61 (50.4)
|
53 (22.8)
|
77 (21.2)
|
38 (100.0)
|
10 (100.0)
|
7 (100.0)
|
1 (100.0)
|
80 (19.0)
|
26 (48.1)
|
33 (17.1)
|
44 (25.6)
|
|
Heart Disease
|
1441 (23.2)
|
137 (26.2)
|
19949 (13.9)
|
9997 (6.2)
|
320 (41.6)
|
92 (51.7)
|
84 (21.8)
|
37 (11.3)
|
159 (30.9)
|
39 (54.2)
|
NA
|
NA
|
40 (4.9)
|
3 (4.6)
|
16 (7.0)
|
0 (0.0)
|
25 (10.5)
|
0 (0.0)
|
17 (9.7)
|
0 (0.0)
|
5 (4.6)
|
0 (0.0)
|
139 (38.3)
|
65 (58.6)
|
45 (21.1)
|
47 (14.6)
|
40 (51.3)
|
13 (61.9)
|
4 (26.7)
|
2 (15.4)
|
77 (19.2)
|
17 (34.0)
|
12 (7.8)
|
18 (10.8)
|
|
Kidney Disease
|
731 (12.3)
|
134 (27.0)
|
4755 (3.5)
|
7077 (4.6)
|
359 (47.5)
|
122 (73.1)
|
214 (56.2)
|
251 (70.1)
|
238 (45.8)
|
64 (80.0)
|
NA
|
NA
|
138 (17.0)
|
28 (43.1)
|
28 (12.3)
|
9 (56.2)
|
38 (15.9)
|
4 (57.1)
|
31 (17.7)
|
2 (50.0)
|
13 (11.9)
|
4 (50.0)
|
153 (46.5)
|
71 (73.2)
|
34 (68.0)
|
57 (77.0)
|
28 (93.3)
|
12 (100.0)
|
4 (66.7)
|
8 (100.0)
|
130 (35.8)
|
32 (71.1)
|
21 (53.8)
|
30 (76.9)
|
|
Dyslipidemia
|
3102 (50.0)
|
276 (52.9)
|
41243 (28.7)
|
27809 (17.1)
|
494 (82.1)
|
117 (84.2)
|
169 (51.8)
|
143 (35.3)
|
563 (72.5)
|
71 (79.8)
|
NA
|
NA
|
331 (40.4)
|
37 (56.9)
|
110 (47.8)
|
13 (81.2)
|
98 (41.0)
|
4 (57.1)
|
78 (44.8)
|
3 (75.0)
|
41 (37.6)
|
5 (62.5)
|
235 (87.4)
|
77 (92.8)
|
56 (80.0)
|
90 (79.6)
|
71 (100.0)
|
21 (100.0)
|
16 (100.0)
|
8 (100.0)
|
238 (83.2)
|
36 (90.0)
|
38 (73.1)
|
37 (56.9)
|
|
Stroke
|
410 (6.6)
|
56 (10.7)
|
5885 (4.1)
|
4062 (2.5)
|
48 (7.1)
|
16 (10.8)
|
138 (27.8)
|
214 (40.5)
|
41 (9.6)
|
8 (15.7)
|
NA
|
NA
|
7 (0.9)
|
1 (1.5)
|
3 (1.3)
|
0 (0.0)
|
1 (0.4)
|
0 (0.0)
|
3 (1.7)
|
0 (0.0)
|
0 (0.0)
|
0 (0.0)
|
24 (7.0)
|
13 (12.3)
|
12 (5.7)
|
19 (5.7)
|
2 (3.4)
|
1 (5.9)
|
0 (0.0)
|
0 (0.0)
|
12 (3.1)
|
5 (10.2)
|
4 (2.6)
|
3 (1.8)
|
|
Alcoholic Drinks / Week
|
7.6 ± 5.4
|
6.0 ± 5.1
|
6.6 ± 5.2
|
5.8 ± 4.9
|
7.8 ± 10.5
|
2.4 ± 5.1
|
4.8 ± 9.6
|
2.7 ± 4.2
|
14.1 ± 19.2
|
4.4 ± 7.5
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
5.5 ± 14.3
|
1.9 ± 7.2
|
5.3 ± 10.9
|
2.4 ± 5.8
|
5.9 ± 8.6
|
2.2 ± 4.4
|
3.2 ± 6.0
|
3.8 ± 7.6
|
4.2 ± 9.0
|
0.9 ± 2.7
|
4.3 ± 11.0
|
1.2 ± 3.7
|
|
Sugar-Sweetened Drinks / Week
|
NA
|
NA
|
NA
|
NA
|
1.0 ± 1.5
|
0.6 ± 1.1
|
0.9 ± 1.3
|
0.5 ± 1.1
|
0.8 ± 1.3
|
0.7 ± 1.2
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
1.7 ± 1.9
|
1.0 ± 1.5
|
1.8 ± 2.5
|
1.2 ± 1.6
|
1.3 ± 1.7
|
0.8 ± 1.3
|
2.1 ± 2.2
|
1.4 ± 2.0
|
2.3 ± 2.2
|
1.4 ± 1.5
|
2.0 ± 1.7
|
1.4 ± 1.5
|
|
Current Smoker
|
349 (5.6)
|
38 (7.3)
|
12298 (8.6)
|
10641 (6.6)
|
22 (4.3)
|
5 (5.8)
|
26 (5.2)
|
20 (4.2)
|
94 (16.5)
|
14 (18.2)
|
NA
|
NA
|
NA
|
NA
|
40 (17.5)
|
0 (0.0)
|
30 (12.6)
|
0 (0.0)
|
35 (20.0)
|
0 (0.0)
|
19 (17.9)
|
0 (0.0)
|
35 (17.8)
|
4 (7.5)
|
41 (29.1)
|
45 (23.4)
|
16 (23.9)
|
3 (15.8)
|
4 (26.7)
|
4 (33.3)
|
12 (9.7)
|
2 (6.1)
|
23 (21.5)
|
15 (13.5)
|
|
Family History of Gout
|
NA
|
NA
|
NA
|
NA
|
402 (44.3)
|
89 (47.1)
|
58 (16.3)
|
76 (23.7)
|
270 (34.3)
|
40 (38.1)
|
3 (33.3)
|
2 (25.0)
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
239 (67.9)
|
80 (74.8)
|
87 (43.7)
|
148 (45.7)
|
79 (68.1)
|
22 (84.6)
|
15 (36.6)
|
19 (57.6)
|
237 (62.0)
|
32 (61.5)
|
66 (38.2)
|
61 (38.9)
|
|
No. Relatives w/ Gout
|
NA
|
NA
|
NA
|
NA
|
0.8 ± 1.0
|
1.0 ± 1.3
|
0.3 ± 0.6
|
0.5 ± 0.7
|
0.7 ± 0.8
|
0.9 ± 0.8
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
1.7 ± 2.2
|
1.8 ± 1.5
|
0.8 ± 1.3
|
0.7 ± 1.0
|
2.0 ± 1.9
|
2.2 ± 1.5
|
0.8 ± 1.0
|
1.3 ± 1.4
|
1.5 ± 1.9
|
1.5 ± 1.9
|
0.6 ± 0.9
|
0.6 ± 0.9
|
|
Note:
|
|
Flares reported as median (inter-quartile range). All other numeric variables reported as mean ± sd. All binary variables reported as N (%).
|
Producing supplementary table of model results.
# Preparing gout model results table (excluding East Polynesians for now).
NotEast <- GoutModels %>%
mutate(Sex = case_when(str_detect(Cohort, "Male") ~ "Male",
str_detect(Cohort, "Female") ~ "Female"),
Cohort = str_remove(Cohort, " - Male| - Female"),
LCL = sprintf(LCL, fmt = "%#.2f"),
UCL = sprintf(UCL, fmt = "%#.2f"),
OR = sprintf(OR, fmt = "%#.2f"),
CI = paste0("[", LCL, ", ", UCL, "]"),
Adjusted = case_when(str_detect(Covariates, "AGECOL") ~ TRUE,
TRUE ~ FALSE)) %>%
filter(!str_detect(Cohort, "East Polynesian")) %>%
select(Cohort, `N case`, `N control`, Sex, Adjusted, Predictor, OR, CI, Pval)
# Producing East Polynesian gout results table.
tmp <- GoutModels %>%
mutate(Sex = case_when(str_detect(Cohort, "Male") ~ "Male",
str_detect(Cohort, "Female") ~ "Female"),
Cohort = str_remove(Cohort, " - Male| - Female"),
Adjusted = case_when(str_detect(Covariates, "AGECOL") ~ TRUE,
TRUE ~ FALSE),
`log-odds` = case_when(`N case` < 20 | `N control` < 20 ~ NA_real_,
TRUE ~ `log-odds`)) %>%
filter(str_detect(Cohort, "East Polynesian"),
!is.na(`log-odds`)) %>%
select(Cohort, `N case`, `N control`, Sex, Predictor, Adjusted, `log-odds`, SE)
# Keeping unadjusted models.
tmp2 <- tmp %>%
filter(!Adjusted)
# Making list of SNPs.
snplist <- as.character(unique(GoutModels$Predictor))
# Preparing tibble with named columns of specified types (using placeholder row).
out <- tibble("Predictor" = "test",
"Male OR" = 1,
"Male LCL" = 1,
"Male UCL" = 1,
"Male Pval" = 1,
"Female OR" = 1,
"Female LCL" = 1,
"Female UCL" = 1,
"Female Pval" = 1)
# For each predictor in snplist.
for(i in seq_along(snplist)){
# Keeping only models with that predictor.
tmp3 <- tmp2 %>%
filter(Predictor == snplist[i])
# Running meta-analysis.
gout <- metagen(TE = `log-odds`,
seTE = SE,
studlab = Cohort,
subgroup = Sex,
data = tmp3,
sm = "OR")
# Adding meta-analysis results to out table.
out <- rbind(out,
list(snplist[i],
exp(gout$TE.fixed.w[[1]]),
exp(gout$lower.fixed.w[[1]]),
exp(gout$upper.fixed.w[[1]]),
gout$pval.fixed.w[[1]],
exp(gout$TE.fixed.w[[2]]),
exp(gout$lower.fixed.w[[2]]),
exp(gout$upper.fixed.w[[2]]),
gout$pval.fixed.w[[2]]))
}
# Cleaning up results table.
out1 <- out %>%
slice(-1) %>%
mutate(Cohort = "East Polynesian",
Adjusted = FALSE,
`Male LCL` = sprintf(`Male LCL`, fmt = "%#.2f"),
`Male UCL` = sprintf(`Male UCL`, fmt = "%#.2f"),
`Male OR` = sprintf(`Male OR`, fmt = "%#.2f"),
`Male CI` = paste0("[", `Male LCL`, ", ", `Male UCL`, "]"),
`Female LCL` = sprintf(`Female LCL`, fmt = "%#.2f"),
`Female UCL` = sprintf(`Female UCL`, fmt = "%#.2f"),
`Female OR` = sprintf(`Female OR`, fmt = "%#.2f"),
`Female CI` = paste0("[", `Female LCL`, ", ", `Female UCL`, "]")) %>%
select(Cohort, Adjusted, Predictor, `Male OR`, `Male CI`, `Male Pval`, `Female OR`, `Female CI`, `Female Pval`)
# Extracting adjusted models.
tmp2 <- tmp %>%
filter(Adjusted)
# Making table for outputting meta-analysis results.
out <- tibble("Predictor" = "test",
"Male OR" = 1,
"Male LCL" = 1,
"Male UCL" = 1,
"Male Pval" = 1,
"Female OR" = 1,
"Female LCL" = 1,
"Female UCL" = 1,
"Female Pval" = 1)
# For each predictor.
for(i in seq_along(snplist)){
# Extracting models with that predictor.
tmp3 <- tmp2 %>%
filter(Predictor == snplist[i])
# Running meta-analysis.
gout <- metagen(TE = `log-odds`,
seTE = SE,
studlab = Cohort,
subgroup = Sex,
data = tmp3,
sm = "OR")
# Summarizing meta-analysis results.
out <- rbind(out,
list(snplist[i],
exp(gout$TE.fixed.w[[1]]),
exp(gout$lower.fixed.w[[1]]),
exp(gout$upper.fixed.w[[1]]),
gout$pval.fixed.w[[1]],
exp(gout$TE.fixed.w[[2]]),
exp(gout$lower.fixed.w[[2]]),
exp(gout$upper.fixed.w[[2]]),
gout$pval.fixed.w[[2]]))
}
# Cleaning up results table.
out2 <- out %>%
slice(-1) %>%
mutate(Cohort = "East Polynesian",
Adjusted = TRUE,
`Male LCL` = sprintf(`Male LCL`, fmt = "%#.2f"),
`Male UCL` = sprintf(`Male UCL`, fmt = "%#.2f"),
`Male OR` = sprintf(`Male OR`, fmt = "%#.2f"),
`Male CI` = paste0("[", `Male LCL`, ", ", `Male UCL`, "]"),
`Female LCL` = sprintf(`Female LCL`, fmt = "%#.2f"),
`Female UCL` = sprintf(`Female UCL`, fmt = "%#.2f"),
`Female OR` = sprintf(`Female OR`, fmt = "%#.2f"),
`Female CI` = paste0("[", `Female LCL`, ", ", `Female UCL`, "]")) %>%
select(Cohort, Adjusted, Predictor, `Male OR`, `Male CI`, `Male Pval`, `Female OR`, `Female CI`, `Female Pval`)
# Recombining meta-analysis results.
EastPoly <- rbind(out1, out2)
# Combining all gout model results.
final <- rbind(NotEast, EastPoly)
# Writing out gout model results supplementary table.
write_delim(final, file = here("Output/Tables/GoutMod.txt"), delim = "\t")
Doing the same for age at onset models.
# Preparing West Polynesian onset models table.
tmp <- OnsetModels %>%
mutate(Sex = case_when(str_detect(Cohort, "Male") ~ "Male",
str_detect(Cohort, "Female") ~ "Female"),
Cohort = str_remove(Cohort, " - Male| - Female"),
LCL = sprintf(LCL, fmt = "%#.2f"),
UCL = sprintf(UCL, fmt = "%#.2f"),
Beta = sprintf(Beta, fmt = "%#.2f"),
CI = paste0("[", LCL, ", ", UCL, "]")) %>%
filter(str_detect(Cohort, "West Polynesian")) %>%
select(Cohort, Sex, Predictor, Beta, CI, Pval)
# Extracting male model results.
tmp1 <- tmp %>%
filter(Sex == "Male") %>%
rename("Male Beta" = Beta,
"Male CI" = CI,
"Male Pval" = Pval) %>%
select(-Sex)
# Extracting female model results.
tmp2 <- tmp %>%
filter(Sex == "Female") %>%
rename("Female Beta" = Beta,
"Female CI" = CI,
"Female Pval" = Pval) %>%
select(-Sex)
# Joining results together.
WestPoly <- left_join(tmp1, tmp2)
# Preparing results table for meta-analyses.
tmp <- OnsetModels %>%
mutate(Sex = case_when(str_detect(Cohort, "Male") ~ "Male",
str_detect(Cohort, "Female") ~ "Female"),
Cohort = str_remove(Cohort, " - Male| - Female")) %>%
select(Cohort, Sex, Predictor, Beta, SE)
# Extracting only European results.
tmp2 <- tmp %>%
filter(!str_detect(Cohort, "Polynesian"))
# Preparing output table.
out <- tibble("Predictor" = "test",
"Male Beta" = 1,
"Male LCL" = 1,
"Male UCL" = 1,
"Male Pval" = 1,
"Female Beta" = 1,
"Female LCL" = 1,
"Female UCL" = 1,
"Female Pval" = 1)
# For each predictor.
for(i in seq_along(snplist)){
# Extracting models with that predictor.
tmp3 <- tmp2 %>%
filter(Predictor == snplist[i])
# Running meta-analysis.
onset <- metagen(TE = Beta,
seTE = SE,
studlab = Cohort,
byvar = Sex,
data = tmp3)
# Extracting summary of meta-analysis.
out <- rbind(out,
list(snplist[i],
onset$TE.fixed.w[[1]],
onset$lower.fixed.w[[1]],
onset$upper.fixed.w[[1]],
onset$pval.fixed.w[[1]],
onset$TE.fixed.w[[2]],
onset$lower.fixed.w[[2]],
onset$upper.fixed.w[[2]],
onset$pval.fixed.w[[2]]))
}
# Cleaning up European age at onset meta-analysis model results table.
Euro <- out %>%
slice(-1) %>%
mutate(Cohort = "European",
`Male LCL` = sprintf(`Male LCL`, fmt = "%#.2f"),
`Male UCL` = sprintf(`Male UCL`, fmt = "%#.2f"),
`Male Beta` = sprintf(`Male Beta`, fmt = "%#.2f"),
`Male CI` = paste0("[", `Male LCL`, ", ", `Male UCL`, "]"),
`Female LCL` = sprintf(`Female LCL`, fmt = "%#.2f"),
`Female UCL` = sprintf(`Female UCL`, fmt = "%#.2f"),
`Female Beta` = sprintf(`Female Beta`, fmt = "%#.2f"),
`Female CI` = paste0("[", `Female LCL`, ", ", `Female UCL`, "]")) %>%
select(Cohort, Predictor, `Male Beta`, `Male CI`, `Male Pval`, `Female Beta`, `Female CI`, `Female Pval`)
# Extracting East Polynesian model results.
tmp2 <- tmp %>%
filter(str_detect(Cohort, "East Polynesian"))
# Preparing output table.
out <- tibble("Predictor" = "test",
"Male Beta" = 1,
"Male LCL" = 1,
"Male UCL" = 1,
"Male Pval" = 1,
"Female Beta" = 1,
"Female LCL" = 1,
"Female UCL" = 1,
"Female Pval" = 1)
# For each predictor.
for(i in seq_along(snplist)){
# Extracting models with that predictor.
tmp3 <- tmp2 %>%
filter(Predictor == snplist[i])
# Running meta-analysis.
onset <- metagen(TE = Beta,
seTE = SE,
studlab = Cohort,
byvar = Sex,
data = tmp3)
# Summarizing meta-analysis.
out <- rbind(out,
list(snplist[i],
onset$TE.fixed.w[[1]],
onset$lower.fixed.w[[1]],
onset$upper.fixed.w[[1]],
onset$pval.fixed.w[[1]],
onset$TE.fixed.w[[2]],
onset$lower.fixed.w[[2]],
onset$upper.fixed.w[[2]],
onset$pval.fixed.w[[2]]))
}
# Producing clean model output table for East Polynesian age at onset results.
EastPoly <- out %>%
slice(-1) %>%
mutate(Cohort = "East Polynesian",
`Male LCL` = sprintf(`Male LCL`, fmt = "%#.2f"),
`Male UCL` = sprintf(`Male UCL`, fmt = "%#.2f"),
`Male Beta` = sprintf(`Male Beta`, fmt = "%#.2f"),
`Male CI` = paste0("[", `Male LCL`, ", ", `Male UCL`, "]"),
`Female LCL` = sprintf(`Female LCL`, fmt = "%#.2f"),
`Female UCL` = sprintf(`Female UCL`, fmt = "%#.2f"),
`Female Beta` = sprintf(`Female Beta`, fmt = "%#.2f"),
`Female CI` = paste0("[", `Female LCL`, ", ", `Female UCL`, "]")) %>%
select(Cohort, Predictor, `Male Beta`, `Male CI`, `Male Pval`, `Female Beta`, `Female CI`, `Female Pval`)
# Combining results of all three ancestral groups.
final <- rbind(Euro, EastPoly, WestPoly)
# Writing out results.
write_delim(final, file = here("Output/Tables/OnsetMod.txt"), delim = "\t")
And finally for tophaceous gout models.
# Preparing table of West Polynesian tophaceous disease models.
tmp <- TophiModels %>%
mutate(Sex = case_when(str_detect(Cohort, "Male") ~ "Male",
str_detect(Cohort, "Female") ~ "Female"),
Cohort = str_remove(Cohort, " - Male| - Female"),
LCL = sprintf(LCL, fmt = "%#.2f"),
UCL = sprintf(UCL, fmt = "%#.2f"),
OR = sprintf(OR, fmt = "%#.2f"),
CI = paste0("[", LCL, ", ", UCL, "]"),
Adjusted = case_when(str_detect(Covariates, "DURATION") ~ TRUE,
TRUE ~ FALSE),
OR = case_when(`N case` < 20 | `N control` < 20 ~ NA_character_,
TRUE ~ OR),
CI = case_when(`N case` < 20 | `N control` < 20 ~ NA_character_,
TRUE ~ CI),
Pval = case_when(`N case` < 20 | `N control` < 20 ~ NA_real_,
TRUE ~ Pval)) %>%
select(Cohort, Sex, Adjusted, Predictor, OR, CI, Pval) %>%
filter(str_detect(Cohort, "West Polynesian"))
# Extracting male results.
tmp1 <- tmp %>%
filter(Sex == "Male") %>%
rename("Male OR" = OR,
"Male CI" = CI,
"Male Pval" = Pval) %>%
select(Cohort, Adjusted, Predictor, `Male OR`:`Male Pval`)
# Extracting female results.
tmp2 <- tmp %>%
filter(Sex == "Female") %>%
rename("Female OR" = OR,
"Female CI" = CI,
"Female Pval" = Pval) %>%
select(Cohort, Adjusted, Predictor, `Female OR`:`Female Pval`)
# Combining together.
WestPoly <- left_join(tmp1, tmp2)
# Preparing for European results summary.
tmp <- TophiModels %>%
mutate(Sex = case_when(str_detect(Cohort, "Male") ~ "Male",
str_detect(Cohort, "Female") ~ "Female"),
Cohort = str_remove(Cohort, " - Male| - Female"),
Adjusted = str_detect(Covariates, "DURATION"),
`log-odds` = case_when(`N case` < 20 | `N control` < 20 ~ NA_real_,
TRUE ~ `log-odds`)) %>%
filter(!str_detect(Cohort, "Polynesian"),
!is.na(`log-odds`)) %>%
select(Cohort, Sex, Predictor, Adjusted, `log-odds`, SE)
# Extracting unadjusted models only.
tmp2 <- tmp %>%
filter(!Adjusted)
# Preparing output table.
out <- tibble("Predictor" = "test",
"Male OR" = 1,
"Male LCL" = 1,
"Male UCL" = 1,
"Male Pval" = 1,
"Female OR" = 1,
"Female LCL" = 1,
"Female UCL" = 1,
"Female Pval" = 1)
# For each predictor.
for(i in seq_along(snplist)){
# Extracting models with that predictor.
tmp3 <- tmp2 %>%
filter(Predictor == snplist[i])
# Running meta-analysis.
tophi <- metagen(TE = `log-odds`,
seTE = SE,
studlab = Cohort,
byvar = Sex,
data = tmp3,
sm = "OR")
# Summarizing meta-analysis.
out <- rbind(out,
list(snplist[i],
exp(tophi$TE.fixed.w[[1]]),
exp(tophi$lower.fixed.w[[1]]),
exp(tophi$upper.fixed.w[[1]]),
tophi$pval.fixed.w[[1]],
exp(tophi$TE.fixed.w[[2]]),
exp(tophi$lower.fixed.w[[2]]),
exp(tophi$upper.fixed.w[[2]]),
tophi$pval.fixed.w[[2]]))
}
# Cleaning up results.
out1 <- out %>%
slice(-1) %>%
mutate(Cohort = "European",
Adjusted = FALSE,
`Male LCL` = sprintf(`Male LCL`, fmt = "%#.2f"),
`Male UCL` = sprintf(`Male UCL`, fmt = "%#.2f"),
`Male OR` = sprintf(`Male OR`, fmt = "%#.2f"),
`Male CI` = paste0("[", `Male LCL`, ", ", `Male UCL`, "]"),
`Female LCL` = sprintf(`Female LCL`, fmt = "%#.2f"),
`Female UCL` = sprintf(`Female UCL`, fmt = "%#.2f"),
`Female OR` = sprintf(`Female OR`, fmt = "%#.2f"),
`Female CI` = paste0("[", `Female LCL`, ", ", `Female UCL`, "]")) %>%
select(Cohort, Adjusted, Predictor, `Male OR`, `Male CI`, `Male Pval`, `Female OR`, `Female CI`, `Female Pval`)
# Extracting adjusted models.
tmp2 <- tmp %>%
filter(Adjusted)
# Preparing output table.
out <- tibble("Predictor" = "test",
"Male OR" = 1,
"Male LCL" = 1,
"Male UCL" = 1,
"Male Pval" = 1,
"Female OR" = 1,
"Female LCL" = 1,
"Female UCL" = 1,
"Female Pval" = 1)
# For each predictor.
for(i in seq_along(snplist)){
# Extracting models with that predictor.
tmp3 <- tmp2 %>%
filter(Predictor == snplist[i])
# Running meta-analysis.
tophi <- metagen(TE = `log-odds`,
seTE = SE,
studlab = Cohort,
byvar = Sex,
data = tmp3,
sm = "OR")
# Summarizing results.
out <- rbind(out,
list(snplist[i],
exp(tophi$TE.fixed.w[[1]]),
exp(tophi$lower.fixed.w[[1]]),
exp(tophi$upper.fixed.w[[1]]),
tophi$pval.fixed.w[[1]],
exp(tophi$TE.fixed.w[[2]]),
exp(tophi$lower.fixed.w[[2]]),
exp(tophi$upper.fixed.w[[2]]),
tophi$pval.fixed.w[[2]]))
}
# Cleaning up results.
out2 <- out %>%
slice(-1) %>%
mutate(Cohort = "European",
Adjusted = TRUE,
`Male LCL` = sprintf(`Male LCL`, fmt = "%#.2f"),
`Male UCL` = sprintf(`Male UCL`, fmt = "%#.2f"),
`Male OR` = sprintf(`Male OR`, fmt = "%#.2f"),
`Male CI` = paste0("[", `Male LCL`, ", ", `Male UCL`, "]"),
`Female LCL` = sprintf(`Female LCL`, fmt = "%#.2f"),
`Female UCL` = sprintf(`Female UCL`, fmt = "%#.2f"),
`Female OR` = sprintf(`Female OR`, fmt = "%#.2f"),
`Female CI` = paste0("[", `Female LCL`, ", ", `Female UCL`, "]")) %>%
select(Cohort, Adjusted, Predictor, `Male OR`, `Male CI`, `Male Pval`, `Female OR`, `Female CI`, `Female Pval`)
# Combining European results together.
Euro <- rbind(out1, out2)
# Preparing East Polynesian tophaceous disease model results.
tmp <- TophiModels %>%
mutate(Sex = case_when(str_detect(Cohort, "Male") ~ "Male",
str_detect(Cohort, "Female") ~ "Female"),
Cohort = str_remove(Cohort, " - Male| - Female"),
LCL = sprintf(LCL, fmt = "%#.2f"),
UCL = sprintf(UCL, fmt = "%#.2f"),
OR = sprintf(OR, fmt = "%#.2f"),
CI = paste0("[", LCL, ", ", UCL, "]"),
Adjusted = case_when(str_detect(Covariates, "DURATION") ~ TRUE,
TRUE ~ FALSE),
OR = case_when(`N case` < 20 | `N control` < 20 ~ NA_character_,
TRUE ~ OR),
CI = case_when(`N case` < 20 | `N control` < 20 ~ NA_character_,
TRUE ~ CI),
Pval = case_when(`N case` < 20 | `N control` < 20 ~ NA_real_,
TRUE ~ Pval)) %>%
select(Cohort, Sex, Adjusted, Predictor, OR, CI, Pval) %>%
filter(str_detect(Cohort, "East Polynesian"))
# Extracting male results.
tmp1 <- tmp %>%
filter(Sex == "Male") %>%
rename("Male OR" = OR,
"Male CI" = CI,
"Male Pval" = Pval) %>%
select(Cohort, Adjusted, Predictor, `Male OR`:`Male Pval`)
# Extracting female results.
tmp2 <- tmp %>%
filter(Sex == "Female") %>%
rename("Female OR" = OR,
"Female CI" = CI,
"Female Pval" = Pval) %>%
select(Cohort, Adjusted, Predictor, `Female OR`:`Female Pval`)
# Combining East Polynesian model results.
EastPoly <- left_join(tmp1, tmp2)
# Combining all tophaceous gout model results.
final <- rbind(Euro, EastPoly, WestPoly)
# Writing out results.
write_delim(final, file = here("Output/Tables/TophiMod.txt"), delim = "\t")
Urate vs Gout PRS
Comparing gout PRS to urate PRS for age at onset associations in men.
# Making list of PRS names.
prsnames <- c("PRS_std", "Urate_PRS_std")
# Preparing European dataset.
tmp <- all_pheno_prs %>%
filter(Pheno.Study != "UK Biobank",
!str_detect(COHORT_GOUT, "Polynesian"),
GOUT,
SEX == "Male",
!is.na(Urate_PRS)) %>%
mutate(PRS_std = PRS / sd(PRS, na.rm = T),
Urate_PRS_std = Urate_PRS / sd(Urate_PRS, na.rm = T))
# Making output vector.
out <- tibble("Outcome" = "test",
"Predictor" = "test",
"Beta_std" = 1,
"LCL_std" = 1,
"UCL_std" = 1,
"P" = 1,
"rsq" = 1,
"Beta_std2" = 1,
"Population" = "test",
"N" = 1)
# Modeling each standardized PRS against age at onset.
for(i in seq_along(prsnames)){
# Setting the full vector of variables to be the predictor and all covariates.
variables <- c(prsnames[i], colnames(tmp %>% select(Geno.PCVector1:Geno.PCVector10)))
variables2 <- c(str_remove(prsnames[i], "_std"), colnames(tmp %>% select(Geno.PCVector1:Geno.PCVector10)))
# Making formula object.
f <- formula(paste("AGE1ATK", paste(variables, collapse = " + "), sep = " ~ "))
f2 <- formula(paste("AGE1ATK", paste(variables2, collapse = " + "), sep = " ~ "))
# Running model.
assign(paste0("mod", i), lm(formula = f, data = tmp))
assign(paste0("bmod", i), lm.beta(lm(formula = f2, data = tmp)))
# Extracting model summaries.
out1 <- list("Age at onset (years)",
prsnames[i],
round(summary(get(paste0("mod", i)))$coefficients[[2, 1]], 3),
round(confint.default(get(paste0("mod", i)))[2,][[1]], 3),
round(confint.default(get(paste0("mod", i)))[2,][[2]], 3),
formatC(summary(get(paste0("mod", i)))$coefficients[[2, 4]], format = "e", digits = 2),
round(rsq.partial(get(paste0("mod", i)), adj = T)$partial.rsq[1], 3),
round(get(paste0("bmod", i))$standardized.coefficients[[2]], 3),
"European",
nrow(tmp))
# Adding results to output vector.
out <- rbind(out, out1)
}
# formatC(summary(get(paste0("mod", i)))$coefficients[[2, 4]], format = "e", digits = 2)
# Cleaning up output table.
out <- out %>%
slice(-1)
# Extracting East Polynesian individuals.
tmp <- all_pheno_prs %>%
filter(str_detect(COHORT_GOUT, "East Polynesian"),
GOUT,
SEX == "Male",
!is.na(Urate_PRS)) %>%
mutate(PRS_std = PRS / sd(PRS, na.rm = T),
Urate_PRS_std = Urate_PRS / sd(Urate_PRS, na.rm = T))
# Modeling each standardized PRS against age at onset.
for(i in seq_along(prsnames)){
# Setting the full vector of variables to be the predictor and all covariates.
variables <- c(prsnames[i], colnames(tmp %>% select(Geno.PCVector1:Geno.PCVector10_Oc)))
variables2 <- c(str_remove(prsnames[i], "_std"), colnames(tmp %>% select(Geno.PCVector1:Geno.PCVector10_Oc)))
# Making formula object.
f <- formula(paste("AGE1ATK", paste(variables, collapse = " + "), sep = " ~ "))
f2 <- formula(paste("AGE1ATK", paste(variables2, collapse = " + "), sep = " ~ "))
# Running model.
assign(paste0("mod", i), lm(f, data = tmp))
assign(paste0("bmod", i), lm.beta(lm(f2, data = tmp)))
# Extracting model summaries.
out1 <- list("Age at onset (years)",
prsnames[i],
round(summary(get(paste0("mod", i)))$coefficients[[2, 1]], 3),
round(confint.default(get(paste0("mod", i)))[2,][[1]], 3),
round(confint.default(get(paste0("mod", i)))[2,][[2]], 3),
formatC(summary(get(paste0("mod", i)))$coefficients[[2, 4]], format = "e", digits = 2),
round(rsq.partial(get(paste0("mod", i)), adj = T)$partial.rsq[1], 3),
round(get(paste0("bmod", i))$standardized.coefficients[[2]], 3),
"East Polynesian",
nrow(tmp))
# Adding results to output vector.
out <- rbind(out, out1)
}
# Extracting West Polynesian individuals.
tmp <- all_pheno_prs %>%
filter(str_detect(COHORT_GOUT, "West Polynesian"),
GOUT,
SEX == "Male",
!is.na(Urate_PRS)) %>%
mutate(PRS_std = PRS / sd(PRS, na.rm = T),
Urate_PRS_std = Urate_PRS / sd(Urate_PRS, na.rm = T))
# Modeling each standardized PRS against age at onset.
for(i in seq_along(prsnames)){
# Setting the full vector of variables to be the predictor and all covariates.
variables <- c(prsnames[i], colnames(tmp %>% select(Geno.PCVector1:Geno.PCVector10_Oc)))
variables2 <- c(str_remove(prsnames[i], "_std"), colnames(tmp %>% select(Geno.PCVector1:Geno.PCVector10_Oc)))
# Making formula object.
f <- formula(paste("AGE1ATK", paste(variables, collapse = " + "), sep = " ~ "))
f2 <- formula(paste("AGE1ATK", paste(variables2, collapse = " + "), sep = " ~ "))
# Running model.
assign(paste0("mod", i), lm(f, data = tmp))
assign(paste0("bmod", i), lm.beta(lm(f2, data = tmp)))
# Extracting model summaries.
out1 <- list("Age at onset (years)",
prsnames[i],
round(summary(get(paste0("mod", i)))$coefficients[[2, 1]], 3),
round(confint.default(get(paste0("mod", i)))[2,][[1]], 3),
round(confint.default(get(paste0("mod", i)))[2,][[2]], 3),
formatC(summary(get(paste0("mod", i)))$coefficients[[2, 4]], format = "e", digits = 2),
round(rsq.partial(get(paste0("mod", i)), adj = T)$partial.rsq[1], 3),
round(get(paste0("bmod", i))$standardized.coefficients[[2]], 3),
"West Polynesian",
nrow(tmp))
# Adding results to output vector.
out <- rbind(out, out1)
}
final <- out %>%
mutate(`95%-CI` = paste0("[", sprintf(LCL_std, fmt = "%#.2f"), ", ", sprintf(UCL_std, fmt = "%#.2f"), "]")) %>%
select(Outcome:Beta_std, `95%-CI`, P:N)
# Writing out results.
write_delim(final, file = here("Output/Tables/uratevsgoutPRS.txt"), delim = "\t")
PRS Weighting Analysis
Comparing Gout PRS, Urate PRS, and unweighted PRS results for age at onset association in men only. Using each PRS, standardized in pooled cohorts of each ancestry, assessing their relative effects. Also producing standardized effect sizes using the lm.beta package and partial R-squared values.
# Making list of PRS names.
prsnames <- c("PRS2_std", "Urate_PRS2_std", "Unweighted_PRS_std")
# Preparing European dataset.
tmp <- all_pheno_prs %>%
filter(Pheno.Study != "UK Biobank",
!str_detect(COHORT_GOUT, "Polynesian"),
GOUT,
SEX == "Male") %>%
mutate(PRS2_std = PRS2 / sd(PRS2, na.rm = T),
Urate_PRS2_std = Urate_PRS2 / sd(Urate_PRS2, na.rm = T),
Unweighted_PRS_std = Unweighted_PRS / sd(Unweighted_PRS, na.rm = T))
# Making output vector.
out <- tibble("Outcome" = "test",
"Predictor" = "test",
"Beta_std" = 1,
"LCL_std" = 1,
"UCL_std" = 1,
"P" = 1,
"rsq" = 1,
"Beta_std2" = 1,
"Population" = "test",
"N" = 1)
# Modeling each standardized PRS against age at onset.
for(i in seq_along(prsnames)){
# Setting the full vector of variables to be the predictor and all covariates.
variables <- c(prsnames[i], colnames(tmp %>% select(Geno.PCVector1:Geno.PCVector10)))
variables2 <- c(str_remove(prsnames[i], "_std"), colnames(tmp %>% select(Geno.PCVector1:Geno.PCVector10)))
# Making formula object.
f <- formula(paste("AGE1ATK", paste(variables, collapse = " + "), sep = " ~ "))
f2 <- formula(paste("AGE1ATK", paste(variables2, collapse = " + "), sep = " ~ "))
# Running model.
assign(paste0("mod", i), lm(f, data = tmp))
assign(paste0("bmod", i), lm.beta(lm(f2, data = tmp)))
# Extracting model summaries.
out1 <- list("Age at onset (years)",
prsnames[i],
round(summary(get(paste0("mod", i)))$coefficients[[2, 1]], 3),
round(confint.default(get(paste0("mod", i)))[2,][[1]], 3),
round(confint.default(get(paste0("mod", i)))[2,][[2]], 3),
formatC(summary(get(paste0("mod", i)))$coefficients[[2, 4]], format = "e", digits = 2),
round(rsq.partial(get(paste0("mod", i)), adj = T)$partial.rsq[1], 3),
round(get(paste0("bmod", i))$standardized.coefficients[[2]], 3),
"European",
nrow(tmp))
# Adding results to output vector.
out <- rbind(out, out1)
}
# Cleaning up output table.
out <- out %>%
slice(-1)
# Extracting East Polynesian individuals.
tmp <- all_pheno_prs %>%
filter(str_detect(COHORT_GOUT, "East Polynesian"),
GOUT,
SEX == "Male") %>%
mutate(PRS2_std = PRS2 / sd(PRS2, na.rm = T),
Urate_PRS2_std = Urate_PRS2 / sd(Urate_PRS2, na.rm = T),
Unweighted_PRS_std = Unweighted_PRS / sd(Unweighted_PRS, na.rm = T))
# Modeling each standardized PRS against age at onset.
for(i in seq_along(prsnames)){
# Setting the full vector of variables to be the predictor and all covariates.
variables <- c(prsnames[i], colnames(tmp %>% select(Geno.PCVector1:Geno.PCVector10_Oc)))
variables2 <- c(str_remove(prsnames[i], "_std"), colnames(tmp %>% select(Geno.PCVector1:Geno.PCVector10_Oc)))
# Making formula object.
f <- formula(paste("AGE1ATK", paste(variables, collapse = " + "), sep = " ~ "))
f2 <- formula(paste("AGE1ATK", paste(variables2, collapse = " + "), sep = " ~ "))
# Running model.
assign(paste0("mod", i), lm(f, data = tmp))
assign(paste0("bmod", i), lm.beta(lm(f2, data = tmp)))
# Extracting model summaries.
out1 <- list("Age at onset (years)",
prsnames[i],
round(summary(get(paste0("mod", i)))$coefficients[[2, 1]], 3),
round(confint.default(get(paste0("mod", i)))[2,][[1]], 3),
round(confint.default(get(paste0("mod", i)))[2,][[2]], 3),
formatC(summary(get(paste0("mod", i)))$coefficients[[2, 4]], format = "e", digits = 2),
round(rsq.partial(get(paste0("mod", i)), adj = T)$partial.rsq[1], 3),
round(get(paste0("bmod", i))$standardized.coefficients[[2]], 3),
"East Polynesian",
nrow(tmp))
# Adding results to output vector.
out <- rbind(out, out1)
}
# Extracting West Polynesian individuals.
tmp <- all_pheno_prs %>%
filter(str_detect(COHORT_GOUT, "West Polynesian"),
GOUT,
SEX == "Male") %>%
mutate(PRS2_std = PRS2 / sd(PRS2, na.rm = T),
Urate_PRS2_std = Urate_PRS2 / sd(Urate_PRS2, na.rm = T),
Unweighted_PRS_std = Unweighted_PRS / sd(Unweighted_PRS, na.rm = T))
# Modeling each standardized PRS against age at onset.
for(i in seq_along(prsnames)){
# Setting the full vector of variables to be the predictor and all covariates.
variables <- c(prsnames[i], colnames(tmp %>% select(Geno.PCVector1:Geno.PCVector10_Oc)))
variables2 <- c(str_remove(prsnames[i], "_std"), colnames(tmp %>% select(Geno.PCVector1:Geno.PCVector10_Oc)))
# Making formula object.
f <- formula(paste("AGE1ATK", paste(variables, collapse = " + "), sep = " ~ "))
f2 <- formula(paste("AGE1ATK", paste(variables2, collapse = " + "), sep = " ~ "))
# Running model.
assign(paste0("mod", i), lm(f, data = tmp))
assign(paste0("bmod", i), lm.beta(lm(f2, data = tmp)))
# Extracting model summaries.
out1 <- list("Age at onset (years)",
prsnames[i],
round(summary(get(paste0("mod", i)))$coefficients[[2, 1]], 3),
round(confint.default(get(paste0("mod", i)))[2,][[1]], 3),
round(confint.default(get(paste0("mod", i)))[2,][[2]], 3),
formatC(summary(get(paste0("mod", i)))$coefficients[[2, 4]], format = "e", digits = 2),
round(rsq.partial(get(paste0("mod", i)), adj = T)$partial.rsq[1], 3),
round(get(paste0("bmod", i))$standardized.coefficients[[2]], 3),
"West Polynesian",
nrow(tmp))
# Adding results to output vector.
out <- rbind(out, out1)
}
final <- out %>%
mutate(`95%-CI` = paste0("[", sprintf(LCL_std, fmt = "%#.2f"), ", ", sprintf(UCL_std, fmt = "%#.2f"), "]")) %>%
select(Outcome:Beta_std, `95%-CI`, P:N)
# Writing out results.
write_delim(final, file = here("Output/Tables/weightingscheme.txt"), delim = "\t")
For age at onset, this suggests that the gout PRS is consistently the best for associations, though it is only marginally better than the urate PRS. The unweighted PRS is consistently the worst.
Standardized Sex Differences + Interaction Term
# Preparing pooled European cohort for analysis.
tmp <- all_pheno_prs %>%
filter(Pheno.Study != "UK Biobank",
!str_detect(COHORT_GOUT, "Polynesian"),
GOUT) %>%
mutate(PRS_std = PRS / sd(PRS, na.rm = T))
# Running onset models in men and women separately.
mod1 <- lm(AGE1ATK ~ PRS_std, data = tmp %>% filter(SEX == "Male"))
mod2 <- lm(AGE1ATK ~ PRS_std, data = tmp %>% filter(SEX == "Female"))
# Analyzing results of models in men and women separately.
round(c(summary(mod1)$coefficients[[2, 1]], confint.default(mod1)[2,]), 3)
## 2.5 % 97.5 %
## -2.577 -3.041 -2.113
round(c(summary(mod2)$coefficients[[2, 1]], confint.default(mod2)[2,]), 3)
## 2.5 % 97.5 %
## -0.77 -2.24 0.70
# Running interaction model.
mod3 <- lm(AGE1ATK ~ PRS_std*SEX, data = tmp)
# Analyzing interaction model shows significant interaction effect (p = 0.02).
summary(mod3)
##
## Call:
## lm(formula = AGE1ATK ~ PRS_std * SEX, data = tmp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.777 -10.531 -0.432 9.742 43.023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.6583 1.5240 39.802 <2e-16 ***
## PRS_std -2.5770 0.2369 -10.876 <2e-16 ***
## SEXFemale 1.9654 4.9204 0.399 0.6896
## PRS_std:SEXFemale 1.8068 0.7786 2.320 0.0204 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.12 on 3906 degrees of freedom
## (106 observations deleted due to missingness)
## Multiple R-squared: 0.1037, Adjusted R-squared: 0.103
## F-statistic: 150.7 on 3 and 3906 DF, p-value: < 2.2e-16
# Plotting interaction.
ggplot(tmp, aes(x = PRS, y = AGE1ATK, color = SEX)) +
geom_point(shape = 1, show.legend = FALSE) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
facet_wrap(~ SEX, nrow = 2) +
labs(x = "Gout Polygenic Risk Score",
y = "Age at Onset (years)")

# Preparing pooled East Polynesian cohort for analysis.
tmp <- all_pheno_prs %>%
filter(str_detect(COHORT_GOUT, "East Polynesian"),
GOUT) %>%
mutate(PRS_std = PRS / sd(PRS, na.rm = T))
# Running onset models in men and women separately.
mod1 <- lm(AGE1ATK ~ PRS_std, data = tmp %>% filter(SEX == "Male"))
mod2 <- lm(AGE1ATK ~ PRS_std, data = tmp %>% filter(SEX == "Female"))
# Analyzing results of models in men and women separately.
round(c(summary(mod1)$coefficients[[2, 1]], confint.default(mod1)[2,]), 3)
## 2.5 % 97.5 %
## -3.068 -4.281 -1.855
round(c(summary(mod2)$coefficients[[2, 1]], confint.default(mod2)[2,]), 3)
## 2.5 % 97.5 %
## -1.670 -4.374 1.034
# Running interaction model.
mod3 <- lm(AGE1ATK ~ PRS_std*SEX, data = tmp)
# Analyzing interaction model shows no significant interaction effect (p = 0.32).
summary(mod3)
##
## Call:
## lm(formula = AGE1ATK ~ PRS_std * SEX, data = tmp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.535 -10.046 0.218 10.354 44.010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.0183 5.3995 11.856 < 2e-16 ***
## PRS_std -3.0680 0.6359 -4.825 1.75e-06 ***
## SEXFemale -1.0426 12.1305 -0.086 0.932
## PRS_std:SEXFemale 1.3980 1.4160 0.987 0.324
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.37 on 641 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.1167, Adjusted R-squared: 0.1125
## F-statistic: 28.22 on 3 and 641 DF, p-value: < 2.2e-16
# Preparing West Polynesian cohort for analysis.
tmp <- all_pheno_prs %>%
filter(str_detect(COHORT_GOUT, "West Polynesian"),
GOUT) %>%
mutate(PRS_std = PRS / sd(PRS, na.rm = T))
# Running onset models in men and women separately.
mod1 <- lm(AGE1ATK ~ PRS_std, data = tmp %>% filter(SEX == "Male"))
mod2 <- lm(AGE1ATK ~ PRS_std, data = tmp %>% filter(SEX == "Female"))
# Analyzing results of models in men and women separately.
round(c(summary(mod1)$coefficients[[2, 1]], confint.default(mod1)[2,]), 3)
## 2.5 % 97.5 %
## -2.01 -3.15 -0.87
round(c(summary(mod2)$coefficients[[2, 1]], confint.default(mod2)[2,]), 3)
## 2.5 % 97.5 %
## -2.909 -7.213 1.395
# Running interaction model.
mod3 <- lm(AGE1ATK ~ PRS_std*SEX, data = tmp)
# Analyzing interaction model shows no significant interaction effect (p = 0.63).
summary(mod3)
##
## Call:
## lm(formula = AGE1ATK ~ PRS_std * SEX, data = tmp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.924 -8.386 -1.478 7.862 41.537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.8659 4.5760 10.897 < 2e-16 ***
## PRS_std -2.0099 0.5972 -3.366 0.000828 ***
## SEXFemale 16.5232 14.4581 1.143 0.253705
## PRS_std:SEXFemale -0.8989 1.8898 -0.476 0.634534
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.16 on 459 degrees of freedom
## (27 observations deleted due to missingness)
## Multiple R-squared: 0.08018, Adjusted R-squared: 0.07416
## F-statistic: 13.34 on 3 and 459 DF, p-value: 2.324e-08
Relative Effect Sizes
Estimating the relative contribution of sex, age, gout genetic risk, BMI, and ancestry on age at onset.
# Preparing cohort.
tmp <- all_pheno_prs %>%
filter(!is.na(AGE1ATK),
!is.na(PRS),
GOUT) %>%
mutate(ANCESTRY = factor(case_when(str_detect(ANCESTRY_GOUT, "European") ~ "European",
str_detect(ANCESTRY_GOUT, "East Polynesian") ~ "East Polynesian",
str_detect(ANCESTRY_GOUT, "West Polynesian") ~ "West Polynesian"),
levels = c("European", "East Polynesian", "West Polynesian")))
# Running model.
mod <- lm(AGE1ATK ~ PRS + SEX + ANCESTRY + BMI, data = tmp)
# Extracting partial R-squared values.
partial_table <- tibble("Predictor" = rsq.partial(mod, adj = T)$variable,
"Partial R-squared" = rsq.partial(mod, adj = T)$partial.rsq,
"Percent variance explained" = round(rsq.partial(mod, adj = T)$partial.rsq * 100, 1))
# Printing table.
kable(partial_table)
Flares plot.
Flares vs PRS
all_pheno_prs %>%
filter(NUMATK > 1,
NUMATK < 53) %>%
mutate(Label = factor(case_when(COHORT_GOUT == "Aus/NZ - Gout" ~ "Aus/NZ",
COHORT_GOUT == "GlobalGout - Gout" ~ "GlobalGout",
str_detect(COHORT_GOUT, "Ardea") ~ as.character(COHORT_GOUT),
COHORT_GOUT == "East Polynesian - Gout" ~ "East Polynesian",
COHORT_GOUT == "East Polynesian - Gout - NP" ~ "East Polynesian - NP",
COHORT_GOUT == "West Polynesian - Gout" ~ "West Polynesian"),
levels = c("Aus/NZ", "GlobalGout", "Ardea - LASSO", "Ardea - CLEAR1", "Ardea - CLEAR2", "Ardea - CRYSTAL", "Ardea - LIGHT", "East Polynesian", "East Polynesian - NP", "West Polynesian"))) %>%
ggplot(aes(x = PRS, y = NUMATK, color = SEX)) +
geom_point(size = 0.5) +
facet_wrap(~ Label, nrow = 2, ncol = 5) +
labs(x = "Gout Polygenic Risk Score", y = "Number of Flares in Last Year (self-report)", color = "Sex")
ggsave(filename = here("Output/Plots/Flares.tiff"), width = 10, height = 5, dpi = 600)